R: Cleaning up a wide and untidy dataframe

I have a data frame that looks like:

d<-data.frame(id=(1:9), 
                  grp_id=(c(rep(1,3), rep(2,3), rep(3,3))), 
                  a=rep(NA, 9), 
                  b=c("No", rep(NA, 3), "Yes", rep(NA, 4)), 
                  c=c(rep(NA,2), "No", rep(NA,6)), 
                  d=c(rep(NA,3), "Yes", rep(NA,2), "No", rep(NA,2)), 
                  e=c(rep(NA, 7), "No", NA), 
                  f=c(NA, "No", rep(NA,3), "No", rep(NA,2), "No"))
>d
  id grp_id  a    b    c    d    e    f
1  1      1 NA   No <NA> <NA> <NA> <NA>
2  2      1 NA <NA> <NA> <NA> <NA>   No
3  3      1 NA <NA>   No <NA> <NA> <NA>
4  4      2 NA <NA> <NA>  Yes <NA> <NA>
5  5      2 NA  Yes <NA> <NA> <NA> <NA>
6  6      2 NA <NA> <NA> <NA> <NA>   No
7  7      3 NA <NA> <NA>   No <NA> <NA>
8  8      3 NA <NA> <NA> <NA>   No <NA>
9  9      3 NA <NA> <NA> <NA> <NA>   No

Within each group (grp_id) there is only 1 "Yes" or "No" value associated with each of the columns a:f.

I'd like to create a single row for each grp_id to get a data frame that looks like the following:

grp_id  a    b    c    d    e    f
     1 NA   No   No <NA> <NA>   No
     2 NA  Yes <NA>  Yes <NA>   No
     3 NA <NA> <NA>   No   No   No

I recognize that the tidyr package is probably the best tool and the 1st steps are likely to be

d %>% 
   group_by(grp_id) %>%
     summarise()

I would appreciate help with the commands within summarise, or any solution really. Thanks.

标签： r tidyr data-cleaning summarize

3条回答

放我归山

2楼-- · 2019-07-11 07:22

You've received some good answers but neither of them actually uses the tidyr package. (The summarize() and summarize_at() family of functions is from dplyr.)

In fact, a tidyr-only solution for your problem is very doable.

d %>%
    gather(col, value, -id, -grp_id, factor_key=TRUE) %>%
    na.omit() %>%
    select(-id) %>%
    spread(col, value, fill=NA, drop=FALSE)

The only hard part is ensuring that you get the a column in your output. For your example data, it is entirely NA. The trick is the factor_key=TRUE argument to gather() and the drop=FALSE argument to spread(). Without those two arguments being set, the output would not have an a column, and would only have columns with at least one non-NA entry.

Here's a description of how it works:

gather(col, value, -id, -grp_id, factor_key=TRUE) %>%

This tidies your data -- it effectively replaces columns a - f with new columns col and value, forming a long-formated "tidy" data frame. The entries in the col column are letters a - f. And because we've used factor_key=TRUE, this column is a factor with levels, not just a character vector.

na.omit() %>%

This removes all the NA values from the long data.

select(-id) %>%

This eliminates the id column.

spread(col, value, fill=NA, drop=FALSE)

This re-widens the data, using the values in the col column to define new column names, and the values in the value column to fill in the entries of the new columns. When data is missing, a value of fill (here NA) is used instead. And the drop=FALSE means that when col is a factor, there will be one column per level of the factor, no matter whether that level appears in the data or not. This, along with setting col to be a factor, is what gets a as an output column.

I personally find this approach more readable than the approaches requiring subsetting or lapply stuff. Additionally, this approach will fail if your data is not actually one-hot, whereas other approaches may "work" and give you unexpected output. The downside of this approach is that the output columns a - f are not factors, but character vectors. If you need factor output you should be able to do (untested)

mutate(value = factor(value, levels=c('Yes', 'No', NA))) %>%

anywhere between the gather() and spread() functions to ensure factor output.

0人赞添加讨论(0) 举报

beautiful°

3楼-- · 2019-07-11 07:26

We can use summarise_at and subset the first non-NA element

library(dplyr)
d %>%
   group_by(grp_id) %>%
   summarise_at(2:7, funs(.[!is.na(.)][1]))
# A tibble: 3 x 7
#   grp_id     a      b      c      d      e      f
#    <dbl> <lgl> <fctr> <fctr> <fctr> <fctr> <fctr>
#1      1    NA     No     No   <NA>   <NA>     No
#2      2    NA    Yes   <NA>    Yes   <NA>     No
#3      3    NA   <NA>   <NA>     No     No     No

In the example dataset, columns 'a' to 'f' are all factors with some having only 'No' levels. If it needs to be standardized with all the columns having the same levels, then we may need to call the factor with levels specified as c('Yes', 'No') in the summarise_at i.e. summarise_at(2:7, funs(factor(.[!is.na(.)][1], levels = c('Yes', 'No'))))

0人赞添加讨论(0) 举报

Melony?

4楼-- · 2019-07-11 07:46

We can use aggregate. No packages are used.

 YN <- function(x) c(na.omit(as.character(x)), NA)[1]
 aggregate(d[3:8], d["grp_id"], YN)

giving:

##   grp_id    a    b    c    d    e  f
## 1      1 <NA>   No   No <NA> <NA> No
## 2      2 <NA>  Yes <NA>  Yes <NA> No
## 3      3 <NA> <NA> <NA>   No   No No

The above gives character columns. If you prefer factor columns then use this:

YNfac <- function(x) factor(YN(x), c("No", "Yes"))
aggregate(d[3:8], d["grp_id"], YNfac)

Note: Other alternate implementations of YN are:

YN <- function(x) sort(as.character(x), na.last = TRUE)[1]

YN <- function(x) if (all(is.na(x))) NA_character_ else na.omit(as.character(x))[1]

library(zoo)
YN <- function(x) na.locf0(as.character(x), fromLast = TRUE)[1]

0人赞添加讨论(0) 举报

R: Cleaning up a wide and untidy dataframe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间