cSplit library(splitstackshape) is always dropping

2019-06-17 05:07发布

问题:

I was searching for a way to split the column content by a separator and converting a table into a long format. I found cSplit from the splitstackshape package and it is almost doing what I was looking for.

Problem is now with the drop option. I expected my split column to be copied in a way, but this does not happen. Am I doing it wrong? Somebody experienced the problem?

I am not sure if I do something wrong, but the drop = FALSE option is not working in my case.

Here is an example:

library(splitstackshape)
jnk <- data.table(a = '1,2,3,4,5', b = 5)
jnk
#            a b
# 1: 1,2,3,4,5 5

cSplit(jnk, 'a', ',', 'long', drop = FALSE)
#    a b
# 1: 1 5
# 2: 2 5
# 3: 3 5
# 4: 4 5
# 5: 5 5

What I expected was something like this:

cSplit(jnk, 'a', ',', 'long', drop = FALSE)
#    a b    a.orig
# 1: 1 5 1,2,3,4,5
# 2: 2 5 1,2,3,4,5
# 3: 3 5 1,2,3,4,5
# 4: 4 5 1,2,3,4,5
# 5: 5 5 1,2,3,4,5

I am using version 1.4.2

回答1:

The "long" format modifies the column in place using list(unlist(...)) within "data.table", assigned with :=. Hence, if drop were used, you would be splitting the column and then removing it!

I'll try to make it explicit in the documentation that drop is only for the wide format, or add a message if a user attempts to use drop in the long format. Feel free to file a FR or submit a PR.

The workaround would be to assign another column (say, "a_orig") and then do the splitting:

jnk <- data.table(a=c('1,2,3,4,5','1,2,3','2,3'),b=c(5,4,3))
cSplit(jnk[, a_orig := a], "a", ",", "long")
#     a b    a_orig
#  1: 1 5 1,2,3,4,5
#  2: 2 5 1,2,3,4,5
#  3: 3 5 1,2,3,4,5
#  4: 4 5 1,2,3,4,5
#  5: 5 5 1,2,3,4,5
#  6: 1 4     1,2,3
#  7: 2 4     1,2,3
#  8: 3 4     1,2,3
#  9: 2 3       2,3
# 10: 3 3       2,3

I haven't tested extensively, but a possible fix could be:

cSplit2 <- function(indt, splitCols, sep = ",", direction = "wide", 
                    fixed = TRUE, drop = TRUE, stripWhite = TRUE, 
                    makeEqual = NULL, type.convert = TRUE) {
  if (direction == "long" & !drop) {
    indt <- as.data.table(indt)
    indt[, `:=`(eval(paste(splitCols, "orig", sep = "_")),
                lapply(splitCols, function(x) indt[[x]]))]
  }
  cSplit(indt, splitCols, sep, direction, fixed, drop, stripWhite, 
         makeEqual, type.convert)
}

The basic idea is to only change the input dataset if direction == "wide" and drop = FALSE. This is similar to the idea that you had, but can possibly be the solution integrated into the actual package, somewhere around line 94. In this case, only the indt[, `:=`(eval(paste(splitCols, "orig", sep = "_")), lapply(splitCols, function(x) indt[[x]]))] part should be necessary.



回答2:

Thanks for the feedback, I wrote a little function as a workaround. I had to change the data.table into data.frame to get it to work properly. In case of the data.table I need to set an additional parameter but it crashes with data.frame. In my case I need most of the time data.frame so I optimised it for it.

library(splitstackshape)
jnk <- data.frame(a = c('1,2,3,4,5','1,2,3','2,3'),
                  b = c('a,b,c,d,e','a,b,c','a,b'),
                  c = c(5,4,3))
jnk

myCSplit <- function(data_set, splitCols, sep = ',', direction = 'long', drop = TRUE, ...) {
  if(direction == 'long' & !drop) {
    orig_names <- sub('$', '_orig', splitCols)
    df <- as.data.frame(data_set[,splitCols])
    names(df) <- orig_names
    df2 <- cbind(data_set, df)
    return(cSplit(df2, splitCols, sep, 'long'))
  } else {
    return(cSplit(data_set, splitCols, sep, direction, drop = drop,...))
  }
}
myCSplit(jnk, 'a', ',')
myCSplit(jnk, 'a', ',', drop = FALSE)
myCSplit(jnk, 'a', ',', 'wide')
myCSplit(jnk, 'a', ',', 'wide', drop = FALSE)
myCSplit(jnk, c('a','b'), ',', 'long', drop = FALSE)