Unpacking and merging lists in a column in data.fr

2020-02-10 03:42发布

问题:

I have the following data.frame:

id     name   altNames
1001   Joan   character(0)      
1002   Jane   c("Janie", "Janet", "Jan")
1003   John   Jon
1004   Bill   Will
1005   Tom    character(0)      

The column altNames could be empty (i.e. character(0)), have just one name, or a list of names. What I want is a data.frame (or a list) where each entry from name and/or altNames appears just once along with the corresponding id, like this:

id     name
1001   Joan
1002   Jane
1002   Janie
1002   Janet
1002   Jan
1003   John
1003   Jon
1004   Bill
1004   Will
1005   Tom

What's the most efficient way of doing it? Even better is dplyr is utilized. Thanks

Edit: Here's the data:

df <- data_frame(
  id = c("1001", "1002","1003", "1004", "1005"), 
  name = c("Joan", "Jane", "John", "Bill", "Tom"), 
  altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)

回答1:

Here's a possible data.table approach

library(data.table)
setDT(dat)[, .(name = c(name, unlist(altNames))), by = id]
#       id  name
#  1: 1001  Joan
#  2: 1002  Jane
#  3: 1002 Janie
#  4: 1002 Janet
#  5: 1002   Jan
#  6: 1003  John
#  7: 1003   Jon
#  8: 1004  Bill
#  9: 1004  Will
# 10: 1005   Tom


回答2:

A base R version (using the df added by @rawr)

with(df, {
    ns <- mapply(c, name, altNames)
    data.frame(id = rep(id, times=lengths(ns)), name=unlist(ns), row.names=NULL)
})
#     id  name
#1  1001  Joan
#2  1002  Jane
#3  1002 Janie
#4  1002 Janet
#5  1002   Jan
#6  1003  John
#7  1003   Jon
#8  1004  Bill
#9  1004  Will
#10 1005   Tom


回答3:

Here's a full dplyr + tidyr solution, the way I'd tackle it:

library(dplyr)
library(tidyr)

df <- data_frame(
  id = c("1001", "1002","1003", "1004", "1005"), 
  name = c("Joan", "Jane", "John", "Bill", "Tom"), 
  altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)

# Need some way to concatenate a list of vectors with a vectors
# in a "rowwise" way
vector_c <- function(...) {
  Map(c, ...)
}

df %>% 
  mutate(
    names = vector_c(name, altNames),
    altNames = NULL,
    name = NULL
  ) %>% 
  unnest(names)
#> Source: local data frame [10 x 2]
#> 
#>      id names
#> 1  1001  Joan
#> 2  1002  Jane
#> 3  1002 Janie
#> 4  1002 Janet
#> 5  1002   Jan
#> 6  1003  John
#> 7  1003   Jon
#> 8  1004  Bill
#> 9  1004  Will
#> 10 1005   Tom

Most of the hard work is done by tidyr::unnest(): it's designed to take data frame with a list-column and unnest it, repeating the other columns as needed.



回答4:

Using tidyr, after cleaning the data with data.table:

First, fix the data:

library(data.table)
dat<-setDT(dat)
dat$altNames[sapply(dat$altNames, length) == 0] <- NA

Now unnest from tidyr and some dplyr:

library(dplyr)
library(tidyr)
dat %>% unnest(altNames) %>%
        group_by(id) %>%
        do(unique(c(.[["name"]],.[["altNames"]])))

     id    V1
1  1001  Joan
2  1001    NA
3  1002  Jane
4  1002 Janie
5  1002 Janet
6  1002   Jan
7  1003  John
8  1003   Jon
9  1004  Bill
10 1004  Will
11 1005   Tom
12 1005    NA

it has the NAs, but they are easily removed with %>% na.omit.

I think data.table is the winner on this one.