I have a data frame that looks like:
d<-data.frame(id=(1:9),
grp_id=(c(rep(1,3), rep(2,3), rep(3,3))),
a=rep(NA, 9),
b=c("No", rep(NA, 3), "Yes", rep(NA, 4)),
c=c(rep(NA,2), "No", rep(NA,6)),
d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)),
e=c(rep(NA, 7), "No", NA),
f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
id grp_id a b c d e f
1 1 1 NA No <NA> <NA> <NA> <NA>
2 2 1 NA <NA> <NA> <NA> <NA> No
3 3 1 NA <NA> No <NA> <NA> <NA>
4 4 2 NA <NA> <NA> Yes <NA> <NA>
5 5 2 NA Yes <NA> <NA> <NA> <NA>
6 6 2 NA <NA> <NA> <NA> <NA> No
7 7 3 NA <NA> <NA> No <NA> <NA>
8 8 3 NA <NA> <NA> <NA> No <NA>
9 9 3 NA <NA> <NA> <NA> <NA> No
Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.
I'd like to create a single row for each grp_id to get a data frame that looks like the following:
grp_id a b c d e f
1 NA No No <NA> <NA> No
2 NA Yes <NA> Yes <NA> No
3 NA <NA> <NA> No No No
I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be
d %>%
group_by(grp_id) %>%
summarise()
I would appreciate help with the commands within summarise, or any solution really. Thanks.
We can use summarise_at
and subset the first non-NA element
library(dplyr)
d %>%
group_by(grp_id) %>%
summarise_at(2:7, funs(.[!is.na(.)][1]))
# A tibble: 3 x 7
# grp_id a b c d e f
# <dbl> <lgl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1 1 NA No No <NA> <NA> No
#2 2 NA Yes <NA> Yes <NA> No
#3 3 NA <NA> <NA> No No No
In the example dataset, columns 'a' to 'f' are all factors
with some having only 'No' levels. If it needs to be standardized with all the columns having the same levels
, then we may need to call the factor
with levels
specified as c('Yes', 'No')
in the summarise_at
i.e. summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))
We can use aggregate
. No packages are used.
YN <- function(x) c(na.omit(as.character(x)), NA)[1]
aggregate(d[3:8], d["grp_id"], YN)
giving:
## grp_id a b c d e f
## 1 1 <NA> No No <NA> <NA> No
## 2 2 <NA> Yes <NA> Yes <NA> No
## 3 3 <NA> <NA> <NA> No No No
The above gives character columns. If you prefer factor columns then use this:
YNfac <- function(x) factor(YN(x), c("No", "Yes"))
aggregate(d[3:8], d["grp_id"], YNfac)
Note: Other alternate implementations of YN are:
YN <- function(x) sort(as.character(x), na.last = TRUE)[1]
YN <- function(x) if (all(is.na(x))) NA_character_ else na.omit(as.character(x))[1]
library(zoo)
YN <- function(x) na.locf0(as.character(x), fromLast = TRUE)[1]
You've received some good answers but neither of them actually uses the tidyr
package. (The summarize()
and summarize_at()
family of functions is from dplyr
.)
In fact, a tidyr
-only solution for your problem is very doable.
d %>%
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
na.omit() %>%
select(-id) %>%
spread(col, value, fill=NA, drop=FALSE)
The only hard part is ensuring that you get the a
column in your output. For your example data, it is entirely NA
. The trick is the factor_key=TRUE
argument to gather()
and the drop=FALSE
argument to spread()
. Without those two arguments being set, the output would not have an a
column, and would only have columns with at least one non-NA
entry.
Here's a description of how it works:
gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
This tidies your data -- it effectively replaces columns a
- f
with new columns col
and value
, forming a long-formated "tidy" data frame. The entries in the col
column are letters a
- f
. And because we've used factor_key=TRUE
, this column is a factor with levels, not just a character vector.
na.omit() %>%
This removes all the NA
values from the long data.
select(-id) %>%
This eliminates the id
column.
spread(col, value, fill=NA, drop=FALSE)
This re-widens the data, using the values in the col
column to define new column names, and the values in the value
column to fill in the entries of the new columns. When data is missing, a value of fill
(here NA
) is used instead. And the drop=FALSE
means that when col
is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col
to be a factor, is what gets a
as an output column.
I personally find this approach more readable than the approaches requiring subsetting or lapply
stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a
- f
are not factors, but character vectors. If you need factor output you should be able to do (untested)
mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%
anywhere between the gather()
and spread()
functions to ensure factor output.