可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a data table containing 20000+ rows and one column. The string in each column has different number of words. I want to split the words and put each of them in a new column. I know how I can do it word by word:
Data [ , Word1 := as.character(lapply(strsplit(as.character(Data$complaint), split=" "), "[", 1))]
(Data
is my data table and complaint
is the name of the column)
Obviously, this is not efficient because each cell in each row has different number of words.
Could you please tell me about a more efficient way to do this?
回答1:
Check out cSplit
from my "splitstackshape" package. It works on either data.frame
s or data.table
s (but always returns a data.table
).
Assuming KFB's sample data is at least slightly representative of your actual data, you can try:
library(splitstackshape)
cSplit(df, "x", " ")
# x_1 x_2 x_3 x_4
# 1: This is interesting NA
# 2: This actually is not
Another (blazing) option is to use stri_split_fixed
with simplify = TRUE
(from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):
library(stringi)
stri_split_fixed(df$x, " ", simplify = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "This" "is" "interesting" NA
# [2,] "This" "actually" "is" "not"
回答2:
Two functions, transpose()
and tstrsplit()
, are available since version 1.9.6 on CRAN.
With this we can do:
require(data.table)
setDT(tstrsplit(as.character(df$x), " ", fixed=TRUE))[]
# V1 V2 V3 V4
# 1: This is interesting NA
# 2: This actually is not
tstrsplit
is a wrapper for transpose(strsplit(...))
.
回答3:
Here is a solution based on rbind.fill.matrix(...)
in the plyr
package. On a dataset with 20,000 rows it runs in about 3.6 sec.
# create an sample dataset - you have this already
library(data.table)
words <- LETTERS[1:10] # "words" are just letters in this example
set.seed(1) # for reproducible example
w <- sapply(1:2e4,function(i)paste(words[sample(1:10,sample(1:10,1))],collapse=" "))
dt <- data.table(words=w)
head(dt)
# complaint
# 1: D F H
# 2: I J F
# 3: A B I E C D H
# 4: J D G H B I A E
# 5: A D G C
# 6: F E B J I
# you start here...
library(plyr)
result <- rbind.fill.matrix(lapply(strsplit(dt$words, split=" "),matrix,nr=1))
result <- as.data.table(result)
head(result)
# 1 2 3 4 5 6 7 8 9 10
# 1: D F H NA NA NA NA NA NA NA
# 2: I J F NA NA NA NA NA NA NA
# 3: A B I E C D H NA NA NA
# 4: J D G H B I A E NA NA
# 5: A D G C NA NA NA NA NA NA
# 6: F E B J I NA NA NA NA NA
EDIT: Added some benchmarking based on @Ananda's comment below.
f.rfm <- function() as.data.table(rbind.fill.matrix(lapply(strsplit(dt$complaint, split=" "),matrix,nr=1)))
library(splitstackshape)
f.csplit <- function() cSplit(dt, "complaint", " ",type.convert=FALSE)
library(stringi)
f.sl2m <- function() as.data.table(stri_list2matrix(strsplit(dt$complaint, split=" "), byrow = TRUE))
f.ssf <- function() as.data.table(stri_split_fixed(dt$complaint, " ", simplify = TRUE))
all.equal(f.rfm(),f.csplit(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.sl2m(),check.names=FALSE)
# [1] TRUE
all.equal(f.rfm(),f.ssf(),check.names=FALSE)
# [1] TRUE
library(microbenchmark)
microbenchmark(f.rfm(),f.csplit(),f.sl2m(),f.ssf(),times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# f.rfm() 3566.17724 3589.31203 3606.93303 3665.4087 3719.32299 10
# f.csplit() 98.05709 102.46456 104.51046 107.9588 117.26945 10
# f.sl2m() 55.45527 55.58852 56.75406 58.9347 67.44523 10
# f.ssf() 17.77499 17.98879 18.30831 18.4537 21.62161 10
So it looks like stri_split_fixed(...)
is the winner.
回答4:
An example data would be nice, but if I understand what you want, it is not possible to do properly in a data frame. Given there are different numbers of words in each row you, will need a list. Even though, it is very simple to split the words in the whole object.
If you run strsplit(as.character(Data[,1]), " ")
you will get a list with each element corresponding to a row in your dataframe. From that, there are several different alternatives to rearrange this object, but the best approach will depend on your objective
回答5:
OK for both data.table and data.frame
# toy data
df <- structure(list(x = structure(c(2L, 1L), .Label = c("This actually is not",
"This is interesting"), class = "factor")), .Names = "x", row.names = c(NA,
-2L), class = "data.frame")
# x
# 1 This is interesting
# 2 This actually is not
# the code
split_result <- strsplit(as.character(df$x), " ")
length_n <- sapply(split_result, length)
length_max <- seq_len(max(length_n))
as.data.frame(t(sapply(split_result, "[", i = length_max))) # Or as.data.table(...)
# V1 V2 V3 V4
# 1 This is interesting <NA>
# 2 This actually is not