Split different lengths values and bind to columns

I've got a rather large (around 100k observations) data set, similar to this:

data <- data.frame(
                 ID = seq(1, 5, 1),
                 Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"), 
                 stringsAsFactors=F)
data
  ID        Values
1  1         1,2,3
2  2             4
3  3              
4  4 4,1,6,5,1,1,6
5  5           0,0

I want to split the Values column by "," with NA for missed cells:

ID v1 v2 v3 v4 v5 v6 v7
1  1  2  3  NA NA NA NA
2  4  NA NA NA NA NA NA
3  NA NA NA NA NA NA NA
4  4  1  6  5  1  1  6
5  0  0  NA NA NA NA NA
...

Best attempt was strsplit + rbind:

df <- data.frame(do.call(
                        "rbind",
                        strsplit(as.character(data$Values), split = "," , fixed = FALSE)
                        ))

But rbind function just recycles all 'short' rows instead to set an "NA". Have found similar problem

Many thanks, Leo

标签： r data.table rbind strsplit

2条回答

迷人小祖宗

2楼-- · 2019-01-15 03:49

Here's a data.table combined with reshape2 approach (should be very efficient)

library(data.table) # Loading `data.table` package
data2 <- setDT(data)[, list(Values = unlist(strsplit(Values, ","))), by = ID] # splitting the values by `,` for each `ID`
data2[, Var := paste0("v", seq_len(.N)), by = ID] # Adding the `Var` variable

library(reshape2) # Loading `reshape2` package
dcast.data.table(data2, ID ~ Var, fill = NA_character_, value.var = "Values") # decasting

#    ID v1 v2 v3 v4 v5 v6 v7
# 1:  1  1  2  3 NA NA NA NA
# 2:  2  4 NA NA NA NA NA NA
# 3:  3    NA NA NA NA NA NA
# 4:  4  4  1  6  5  1  1  6
# 5:  5  0  0 NA NA NA NA NA

0人赞添加讨论(0) 举报

走好不送

3楼-- · 2019-01-15 04:06

I would suggest looking at my cSplit function or approaching the problem manually.

The cSplit approach would simply be:

cSplit(data, "Values", ",")
#    ID Values_1 Values_2 Values_3 Values_4 Values_5 Values_6 Values_7
# 1:  1        1        2        3       NA       NA       NA       NA
# 2:  2        4       NA       NA       NA       NA       NA       NA
# 3:  3                NA       NA       NA       NA       NA       NA
# 4:  4        4        1        6        5        1        1        6
# 5:  5        0        0       NA       NA       NA       NA       NA

Approaching the problem manually would look like:

## Split up the values
Split <- strsplit(data$Values, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),
            ncol = max(Ncol), 
            dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol), 
        sequence(Ncol))] <- unlist(Split, use.names = FALSE)
## Bind the values back together, here as a "data.table" (faster)
data.table(ID = data$ID, M)

^^ That's pretty much what goes on in cSplit, but the function has a few other options and some basic error checking and so on that might make it a little bit slower than a purely manual approach (or a function written to address your specific problem).

Both of these approaches would be faster than a "data.table" + "reshape2" approach. Also, since each row is treated individually, you shouldn't have any problems even if you have duplicated ID values--your output should have the same number of rows as your input.

Benchmarks

I've done benchmarks on more rows and on data that would give "wider" results (since that's implied in your comments to David's answer).

Here is the sample data:

set.seed(1)
a <- sample(0:100, 100000, TRUE)
Values <- vapply(a, function(x) 
  paste(sample(0:100, x, TRUE), collapse = ","), character(1L))
Values[sample(length(Values), length(Values) * .15)] <- ""
ID <- c(1:80000, 1:20000)
data <- data.frame(ID, Values, stringsAsFactors = FALSE)
DT <- as.data.table(data)

Here are the functions to test:

fun1a <- function(inDT) {
  data2 <- DT[, list(Values = unlist(
    strsplit(Values, ","))), by = ID]
  data2[, Var := paste0("v", seq_len(.N)), by = ID] 
  dcast.data.table(data2, ID ~ Var, 
                   fill = NA_character_, 
                   value.var = "Values")
}

fun1b <- function(inDT) {
  data2 <- DT[, list(Values = unlist(
    strsplit(Values, ",", fixed = TRUE), 
    use.names = FALSE)), by = ID]
  data2[, Var := paste0("v", seq_len(.N)), by = ID] 
  dcast.data.table(data2, ID ~ Var, 
                   fill = NA_character_, 
                   value.var = "Values")
}

fun2 <- function(inDT) {
  cSplit(DT, "Values", ",")
}

fun3 <- function(inDF) {
  Split <- strsplit(inDF$Values, ",", fixed = TRUE)
  Ncol <- vapply(Split, length, 1L)
  M <- matrix(NA_character_, nrow = nrow(inDF),
              ncol = max(Ncol), 
              dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
  M[cbind(rep(1:nrow(inDF), Ncol), 
          sequence(Ncol))] <- unlist(Split, use.names = FALSE)
  data.table(ID = inDF$ID, M)
}

Here are the results:

library(microbenchmark)
microbenchmark(fun2(DT), fun3(data), times = 20)
# Unit: seconds
#        expr      min       lq   median       uq      max neval
#    fun2(DT) 4.810942 5.173103 5.498279 5.622279 6.003339    20
#  fun3(data) 3.847228 3.929311 4.058728 4.160082 4.664568    20

## Didn't want to microbenchmark here...
system.time(fun1a(DT))
#    user  system elapsed 
#   16.92    0.50   17.59
system.time(fun1b(DT))  # fixed = TRUE & use.names = FALSE
#    user  system elapsed 
#   11.54    0.42   12.01

NOTE: The results of fun1a and fun1b would not be the same as those of fun2 and fun3 because of the duplicated IDs.

0人赞添加讨论(0) 举报

Split different lengths values and bind to columns

Benchmarks

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间