I've got a rather large (around 100k observations) data set, similar to this:
data <- data.frame(
ID = seq(1, 5, 1),
Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"),
stringsAsFactors=F)
data
ID Values
1 1 1,2,3
2 2 4
3 3
4 4 4,1,6,5,1,1,6
5 5 0,0
I want to split the Values column by ","
with NA
for missed cells:
ID v1 v2 v3 v4 v5 v6 v7
1 1 2 3 NA NA NA NA
2 4 NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 4 1 6 5 1 1 6
5 0 0 NA NA NA NA NA
...
Best attempt was strsplit
+ rbind
:
df <- data.frame(do.call(
"rbind",
strsplit(as.character(data$Values), split = "," , fixed = FALSE)
))
But rbind
function just recycles all 'short' rows instead to set an "NA".
Have found similar problem
Many thanks, Leo
Here's a
data.table
combined withreshape2
approach (should be very efficient)I would suggest looking at my
cSplit
function or approaching the problem manually.The
cSplit
approach would simply be:Approaching the problem manually would look like:
^^ That's pretty much what goes on in
cSplit
, but the function has a few other options and some basic error checking and so on that might make it a little bit slower than a purely manual approach (or a function written to address your specific problem).Both of these approaches would be faster than a "data.table" + "reshape2" approach. Also, since each row is treated individually, you shouldn't have any problems even if you have duplicated ID values--your output should have the same number of rows as your input.
Benchmarks
I've done benchmarks on more rows and on data that would give "wider" results (since that's implied in your comments to David's answer).
Here is the sample data:
Here are the functions to test:
Here are the results:
NOTE: The results of
fun1a
andfun1b
would not be the same as those offun2
andfun3
because of the duplicated IDs.