combine duplicates, do not publish blanks, dplyr::

2019-09-22 02:42发布

问题:

I trying to using dplyr distinct to combine rows, delete duplicates, and delete blanks as well. Here is my data frame:

unique_id   school  subject  grade  sex
    1       great   Math      88    
    1       great   English   78    
    1       great   History   98    male
    2       spring  Math      65    
    2       spring  English   72    female
    2       spring  History   84    

When I run (thank you Akrun):

(r2 <- df %>%
  group_by(unique_id) %>% 
  summarise_each(funs(toString(unique(.)))))

I get:

unique_id   school  subject                     grade       sex
    1       great   Math, English, History      88,78,98     , male 
    2       spring  English, English, History   65,72,84     , female

I don't want blanks to be included or commas in the last variable, sex. Instead, I'd like it to look as follows:

unique_id   school  subject                     grade       sex
    1       great   Math, English, History      88,78,98     male   
    2       spring  English, English, History   65,72,84     female

Any tried adding NA on the import, then removing it after condensing and that didn't work. Any ideas how to condense rows, but only keep the value in the row and ignore blanks? Thank you.

回答1:

Perhaps the reason that you are having problems is that you are using empty strings when you should be using NAs. This is what I would assume is the idiomatic code.

df <- data.frame(unique_id = c(rep(1,3),rep(2,3)),
                school = c(rep('great',3),rep('spring',3)),
                           subject = rep(c("Math", "English", "History"),2),
                           grade = c(88,78,98,65,72,84),
                           sex = c(NA,NA, "male", NA, "female", NA))

r2 <- df %>%
  group_by(unique_id) %>% 
  summarise_each(funs(toString(unique(.))))

which returns

# A tibble: 2 x 5
  unique_id school                subject      grade        sex
      <dbl>  <chr>                  <chr>      <chr>      <chr>
1         1  great Math, English, History 88, 78, 98   NA, male
2         2 spring Math, English, History 65, 72, 84 NA, female

You can always

 r2$sex <- sapply(stringr::str_split(r2$sex, ", "),"[",2)

afterwards if you really want to remove those NAs, but I see them as informative.

You can write your own function to supply to summarize_each, which will allow you to take care of NAs in any column. Note, that you only need to do this because unique, rightfully so, does not have an na.rm argument.

rm_na_unique <- function(vec){
  unique(vec[!is.na(vec)])
}

r2 <- df %>%
       group_by(unique_id) %>% 
       summarise_each(funs(toString(rm_na_unique(.))))

Gives you the same result

# A tibble: 2 x 5
  unique_id school                subject      grade    sex
      <dbl>  <chr>                  <chr>      <chr>  <chr>
1         1  great Math, English, History 88, 78, 98   male
2         2 spring Math, English, History 65, 72, 84 female


回答2:

You can fill the sex column so that there will be no empty strings in it and then your solution will work:

library(dplyr)
df %>% group_by(unique_id) %>% mutate(sex = unique(sex[sex != ""])) %>% 
       summarise_each(funs(toString(unique(.))))

# Source: local data frame [2 x 5]
# 
#   unique_id school                subject      grade    sex
#       <chr>  <chr>                  <chr>      <chr>  <chr>
# 1         1  great Math, English, History 88, 78, 98   male
# 2         2 spring Math, English, History 65, 72, 84 female


标签: r dplyr