I was searching for a way to split the column content by a separator and converting a table into a long format. I found cSplit
from the splitstackshape
package and it is almost doing what I was looking for.
Problem is now with the drop
option. I expected my split column to be copied in a way, but this does not happen. Am I doing it wrong? Somebody experienced the problem?
I am not sure if I do something wrong, but the drop = FALSE
option is not working in my case.
Here is an example:
library(splitstackshape)
jnk <- data.table(a = '1,2,3,4,5', b = 5)
jnk
# a b
# 1: 1,2,3,4,5 5
cSplit(jnk, 'a', ',', 'long', drop = FALSE)
# a b
# 1: 1 5
# 2: 2 5
# 3: 3 5
# 4: 4 5
# 5: 5 5
What I expected was something like this:
cSplit(jnk, 'a', ',', 'long', drop = FALSE)
# a b a.orig
# 1: 1 5 1,2,3,4,5
# 2: 2 5 1,2,3,4,5
# 3: 3 5 1,2,3,4,5
# 4: 4 5 1,2,3,4,5
# 5: 5 5 1,2,3,4,5
I am using version 1.4.2
The "long" format modifies the column in place using list(unlist(...))
within "data.table", assigned with :=
. Hence, if drop
were used, you would be splitting the column and then removing it!
I'll try to make it explicit in the documentation that drop
is only for the wide
format, or add a message
if a user attempts to use drop
in the long format. Feel free to file a FR or submit a PR.
The workaround would be to assign another column (say, "a_orig") and then do the splitting:
jnk <- data.table(a=c('1,2,3,4,5','1,2,3','2,3'),b=c(5,4,3))
cSplit(jnk[, a_orig := a], "a", ",", "long")
# a b a_orig
# 1: 1 5 1,2,3,4,5
# 2: 2 5 1,2,3,4,5
# 3: 3 5 1,2,3,4,5
# 4: 4 5 1,2,3,4,5
# 5: 5 5 1,2,3,4,5
# 6: 1 4 1,2,3
# 7: 2 4 1,2,3
# 8: 3 4 1,2,3
# 9: 2 3 2,3
# 10: 3 3 2,3
I haven't tested extensively, but a possible fix could be:
cSplit2 <- function(indt, splitCols, sep = ",", direction = "wide",
fixed = TRUE, drop = TRUE, stripWhite = TRUE,
makeEqual = NULL, type.convert = TRUE) {
if (direction == "long" & !drop) {
indt <- as.data.table(indt)
indt[, `:=`(eval(paste(splitCols, "orig", sep = "_")),
lapply(splitCols, function(x) indt[[x]]))]
}
cSplit(indt, splitCols, sep, direction, fixed, drop, stripWhite,
makeEqual, type.convert)
}
The basic idea is to only change the input dataset if direction == "wide"
and drop = FALSE
. This is similar to the idea that you had, but can possibly be the solution integrated into the actual package, somewhere around line 94. In this case, only the indt[, `:=`(eval(paste(splitCols, "orig", sep = "_")), lapply(splitCols, function(x) indt[[x]]))]
part should be necessary.
Thanks for the feedback, I wrote a little function as a workaround.
I had to change the data.table
into data.frame
to get it to work properly. In case of the data.table
I need to set an additional parameter but it crashes with data.frame
. In my case I need most of the time data.frame
so I optimised it for it.
library(splitstackshape)
jnk <- data.frame(a = c('1,2,3,4,5','1,2,3','2,3'),
b = c('a,b,c,d,e','a,b,c','a,b'),
c = c(5,4,3))
jnk
myCSplit <- function(data_set, splitCols, sep = ',', direction = 'long', drop = TRUE, ...) {
if(direction == 'long' & !drop) {
orig_names <- sub('$', '_orig', splitCols)
df <- as.data.frame(data_set[,splitCols])
names(df) <- orig_names
df2 <- cbind(data_set, df)
return(cSplit(df2, splitCols, sep, 'long'))
} else {
return(cSplit(data_set, splitCols, sep, direction, drop = drop,...))
}
}
myCSplit(jnk, 'a', ',')
myCSplit(jnk, 'a', ',', drop = FALSE)
myCSplit(jnk, 'a', ',', 'wide')
myCSplit(jnk, 'a', ',', 'wide', drop = FALSE)
myCSplit(jnk, c('a','b'), ',', 'long', drop = FALSE)