I have the following data.frame
:
id name altNames
1001 Joan character(0)
1002 Jane c("Janie", "Janet", "Jan")
1003 John Jon
1004 Bill Will
1005 Tom character(0)
The column altNames
could be empty (i.e. character(0)), have just one name, or a list of names. What I want is a data.frame
(or a list) where each entry from name
and/or altNames
appears just once along with the corresponding id
, like this:
id name
1001 Joan
1002 Jane
1002 Janie
1002 Janet
1002 Jan
1003 John
1003 Jon
1004 Bill
1004 Will
1005 Tom
What's the most efficient way of doing it? Even better is dplyr
is utilized.
Thanks
Edit: Here's the data:
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
Here's a possible data.table
approach
library(data.table)
setDT(dat)[, .(name = c(name, unlist(altNames))), by = id]
# id name
# 1: 1001 Joan
# 2: 1002 Jane
# 3: 1002 Janie
# 4: 1002 Janet
# 5: 1002 Jan
# 6: 1003 John
# 7: 1003 Jon
# 8: 1004 Bill
# 9: 1004 Will
# 10: 1005 Tom
A base R version (using the df
added by @rawr)
with(df, {
ns <- mapply(c, name, altNames)
data.frame(id = rep(id, times=lengths(ns)), name=unlist(ns), row.names=NULL)
})
# id name
#1 1001 Joan
#2 1002 Jane
#3 1002 Janie
#4 1002 Janet
#5 1002 Jan
#6 1003 John
#7 1003 Jon
#8 1004 Bill
#9 1004 Will
#10 1005 Tom
Here's a full dplyr + tidyr solution, the way I'd tackle it:
library(dplyr)
library(tidyr)
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
# Need some way to concatenate a list of vectors with a vectors
# in a "rowwise" way
vector_c <- function(...) {
Map(c, ...)
}
df %>%
mutate(
names = vector_c(name, altNames),
altNames = NULL,
name = NULL
) %>%
unnest(names)
#> Source: local data frame [10 x 2]
#>
#> id names
#> 1 1001 Joan
#> 2 1002 Jane
#> 3 1002 Janie
#> 4 1002 Janet
#> 5 1002 Jan
#> 6 1003 John
#> 7 1003 Jon
#> 8 1004 Bill
#> 9 1004 Will
#> 10 1005 Tom
Most of the hard work is done by tidyr::unnest()
: it's designed to take data frame with a list-column and unnest it, repeating the other columns as needed.
Using tidyr
, after cleaning the data with data.table
:
First, fix the data:
library(data.table)
dat<-setDT(dat)
dat$altNames[sapply(dat$altNames, length) == 0] <- NA
Now unnest
from tidyr
and some dplyr
:
library(dplyr)
library(tidyr)
dat %>% unnest(altNames) %>%
group_by(id) %>%
do(unique(c(.[["name"]],.[["altNames"]])))
id V1
1 1001 Joan
2 1001 NA
3 1002 Jane
4 1002 Janie
5 1002 Janet
6 1002 Jan
7 1003 John
8 1003 Jon
9 1004 Bill
10 1004 Will
11 1005 Tom
12 1005 NA
it has the NAs, but they are easily removed with %>% na.omit
.
I think data.table is the winner on this one.