Strsplit on a column of a data frame [duplicate]

2019-02-27 03:11发布

问题:

This question already has an answer here:

  • Split comma-separated strings in a column into separate rows 4 answers

I have a data.frame where one of the variables is a vector (or a list), like this:

MyColumn <- c("A, B,C", "D,E", "F","G")
MyDF <- data.frame(group_id=1:4, val=11:14, cat=MyColumn)

#   group_id val    cat
# 1        1  11 A, B,C
# 2        2  12    D,E
# 3        3  13      F
# 4        4  14      G

I'd like to have a new data frame with as many rows as the vector

FlatColumn <- unlist(strsplit(MyColumn,split=","))

which looks like this:

MyNewDF <- data.frame(group_id=c(rep(1,3),rep(2,2),3,4), val=c(rep(11,3),rep(12,2),13,14), cat=FlatColumn)

#   group_id val cat
# 1        1  11   A
# 2        1  11   B
# 3        1  11   C
# 4        2  12   D
# 5        2  12   E
# 6        3  13   F
# 7        4  14   G

In essence, for every factor which is an element of the list of MyColumn (the letters A to G), I want to assign the corresponding values of the list. Every factor appears only once in MyColumn.

Is there a neat way for this kind of reshaping/unlisting/merging? I've come up with a very cumbersome for-loop over the rows of MyDF and the length of the corresponding element of strsplit(MyColumn,split=","). I'm very sure that there has to be a more elegant way.

回答1:

How about

lst <- strsplit(MyColumn, split = ",")
k <- lengths(lst)    ## expansion size
FlatColumn <- unlist(lst, use.names = FALSE)
MyNewDF <- data.frame(group_id = rep.int(MyDF$group_id, k),
                      val = rep.int(MyDF$val, k),
                      cat = FlatColumn)

#  group_id val cat
#1        1  11   A
#2        1  11   B
#3        1  11   C
#4        2  12   D
#5        2  12   E
#6        3  13   F
#7        4  14   G


回答2:

You can use separate_rows from tidyr:

tidyr::separate_rows(MyDF, cat)

#   group_id val cat
# 1        1  11   A
# 2        1  11   B
# 3        1  11   C
# 4        2  12   D
# 5        2  12   E
# 6        3  13   F
# 7        4  14   G


回答3:

We can use cSplit from splitstackshape

library(splitstackshape)
cSplit(MyDF, "cat", ",", "long")
#    group_id val cat
#1:        1  11   A
#2:        1  11   B
#3:        1  11   C
#4:        2  12   D
#5:        2  12   E
#6:        3  13   F
#7:        4  14   G

We can also use do with base R with strsplit to split the 'cat' column into a list, replicate the sequence of rows of 'MyDF' with the lengths of 'lst', and create the 'cat' column by unlisting the 'lst'.

lst <- strsplit(as.character(MyDF$cat), ",")
transform(MyDF[rep(1:nrow(MyDF), lengths(lst)),-3], cat = unlist(lst))