I have a data table containing 20000+ rows and one column. The string in each column has different number of words. I want to split the words and put each of them in a new column. I know how I can do it word by word:
Data [ , Word1 := as.character(lapply(strsplit(as.character(Data$complaint), split=" "), "[", 1))]
(Data
is my data table and complaint
is the name of the column)
Obviously, this is not efficient because each cell in each row has different number of words.
Could you please tell me about a more efficient way to do this?
An example data would be nice, but if I understand what you want, it is not possible to do properly in a data frame. Given there are different numbers of words in each row you, will need a list. Even though, it is very simple to split the words in the whole object.
If you run
strsplit(as.character(Data[,1]), " ")
you will get a list with each element corresponding to a row in your dataframe. From that, there are several different alternatives to rearrange this object, but the best approach will depend on your objectiveOK for both data.table and data.frame
Check out
cSplit
from my "splitstackshape" package. It works on eitherdata.frame
s ordata.table
s (but always returns adata.table
).Assuming KFB's sample data is at least slightly representative of your actual data, you can try:
Another (blazing) option is to use
stri_split_fixed
withsimplify = TRUE
(from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):Here is a solution based on
rbind.fill.matrix(...)
in theplyr
package. On a dataset with 20,000 rows it runs in about 3.6 sec.EDIT: Added some benchmarking based on @Ananda's comment below.
So it looks like
stri_split_fixed(...)
is the winner.Two functions,
transpose()
andtstrsplit()
, are available since version 1.9.6 on CRAN.With this we can do:
tstrsplit
is a wrapper fortranspose(strsplit(...))
.