I have a data frame that looks like:
d<-data.frame(id=(1:9),
grp_id=(c(rep(1,3), rep(2,3), rep(3,3))),
a=rep(NA, 9),
b=c("No", rep(NA, 3), "Yes", rep(NA, 4)),
c=c(rep(NA,2), "No", rep(NA,6)),
d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)),
e=c(rep(NA, 7), "No", NA),
f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
id grp_id a b c d e f
1 1 1 NA No <NA> <NA> <NA> <NA>
2 2 1 NA <NA> <NA> <NA> <NA> No
3 3 1 NA <NA> No <NA> <NA> <NA>
4 4 2 NA <NA> <NA> Yes <NA> <NA>
5 5 2 NA Yes <NA> <NA> <NA> <NA>
6 6 2 NA <NA> <NA> <NA> <NA> No
7 7 3 NA <NA> <NA> No <NA> <NA>
8 8 3 NA <NA> <NA> <NA> No <NA>
9 9 3 NA <NA> <NA> <NA> <NA> No
Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.
I'd like to create a single row for each grp_id to get a data frame that looks like the following:
grp_id a b c d e f
1 NA No No <NA> <NA> No
2 NA Yes <NA> Yes <NA> No
3 NA <NA> <NA> No No No
I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be
d %>%
group_by(grp_id) %>%
summarise()
I would appreciate help with the commands within summarise, or any solution really. Thanks.
You've received some good answers but neither of them actually uses the
tidyr
package. (Thesummarize()
andsummarize_at()
family of functions is fromdplyr
.)In fact, a
tidyr
-only solution for your problem is very doable.The only hard part is ensuring that you get the
a
column in your output. For your example data, it is entirelyNA
. The trick is thefactor_key=TRUE
argument togather()
and thedrop=FALSE
argument tospread()
. Without those two arguments being set, the output would not have ana
column, and would only have columns with at least one non-NA
entry.Here's a description of how it works:
This tidies your data -- it effectively replaces columns
a
-f
with new columnscol
andvalue
, forming a long-formated "tidy" data frame. The entries in thecol
column are lettersa
-f
. And because we've usedfactor_key=TRUE
, this column is a factor with levels, not just a character vector.This removes all the
NA
values from the long data.This eliminates the
id
column.This re-widens the data, using the values in the
col
column to define new column names, and the values in thevalue
column to fill in the entries of the new columns. When data is missing, a value offill
(hereNA
) is used instead. And thedrop=FALSE
means that whencol
is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with settingcol
to be a factor, is what getsa
as an output column.I personally find this approach more readable than the approaches requiring subsetting or
lapply
stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columnsa
-f
are not factors, but character vectors. If you need factor output you should be able to do (untested)anywhere between the
gather()
andspread()
functions to ensure factor output.We can use
summarise_at
and subset the first non-NA elementIn the example dataset, columns 'a' to 'f' are all
factors
with some having only 'No' levels. If it needs to be standardized with all the columns having the samelevels
, then we may need to call thefactor
withlevels
specified asc('Yes', 'No')
in thesummarise_at
i.e.summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))
We can use
aggregate
. No packages are used.giving:
The above gives character columns. If you prefer factor columns then use this:
Note: Other alternate implementations of YN are: