可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a vector consisting of full names with the first and last name separated by a comma this is what the first few elements look like:
> head(val.vec)
[1] "Aabye,ֲ Edgar" "Aaltonen,ֲ Arvo" "Aaltonen,ֲ Paavo"
[4] "Aalvik Grimsb,ֲ Kari" "Aamodt,ֲ Kjetil Andr" "Aamodt,ֲ Ragnhild
I am looking for a way to split them in to 2 separate columns of first and last name. My final intention is to have both of them as a part of a bigger data frame.
I tried using strsplit
function like this
names<-unlist(strsplit(val.vec,','))
but it gave me one long vector instead of 2 separate sets, I know it is
Possible to use a loop and go over all the elements and place the first and last name in 2 separate vectors, but it is a little time consuming considering the fact that there are about 25000 records.
I saw a few similar questions but the discussion was how to do it on C+ and Java
回答1:
We can use read.csv
to convert the vector
into a data.frame
with 2 columns
read.csv(text=val.vec, header=FALSE, stringsAsFactors=FALSE)
Or if we are using strsplit
, instead of unlist
ing (which will convert the whole list
to a single vector
), we can extract the first and second elements in the list
separately to create two vector
s ('v1' and 'v2').
lst <- strsplit(val.vec,',')
v1 <- lapply(lst, `[`, 1)
v2 <- lapply(lst, `[`, 2)
Yet another option would be sub
v1 <- sub(",.*", "", val.vec)
v2 <- sub("[^,]+,", "", val.vec)
data
val.vec <- c("Aabye,ֲ Edgar", "Aaltonen,ֲ Arvo", "Aaltonen,ֲ Paavo",
"Aalvik Grimsb,ֲ Kari", "Aamodt,ֲ Kjetil Andr", "Aamodt,ֲ Ragnhild")
回答2:
Another option:
library(stringi)
stri_split_fixed(val.vec, ",", simplify = TRUE)
Which gives:
# [,1] [,2]
#[1,] "Aabye" "ֲ Edgar"
#[2,] "Aaltonen" "ֲ Arvo"
#[3,] "Aaltonen" "ֲ Paavo"
#[4,] "Aalvik Grimsb" "ֲ Kari"
#[5,] "Aamodt" "ֲ Kjetil Andr"
#[6,] "Aamodt" "ֲ Ragnhild"
Should you want the result in a data.frame
, you could wrap it in as.data.frame()
回答3:
Just encase your function call into a sapply
call:
val.vec = c("Aabye,ֲ Edgar", "Aaltonen,ֲ Arvo", "Aaltonen,ֲ Paavo", "Aalvik Grimsb,ֲ Kari", "Aamodt,ֲ Kjetil Andr", "Aamodt,ֲ Ragnhild")
names = t(sapply(val.vec, function(x) unlist(strsplit(x,','))))
names
#> names
# [,1] [,2]
#Aabye,? Edgar "Aabye" "? Edgar"
#Aaltonen,? Arvo "Aaltonen" "? Arvo"
#Aaltonen,? Paavo "Aaltonen" "? Paavo"
#Aalvik Grimsb,? Kari "Aalvik Grimsb" "? Kari"
#Aamodt,? Kjetil Andr "Aamodt" "? Kjetil Andr"
#Aamodt,? Ragnhild "Aamodt" "? Ragnhild"
Using the solution you tried we can coerce it to two columns.
val.vec = c("Aabye,ֲ Edgar", "Aaltonen,ֲ Arvo", "Aaltonen,ֲ Paavo", "Aalvik Grimsb,ֲ Kari", "Aamodt,ֲ Kjetil Andr", "Aamodt,ֲ Ragnhild")
names = matrix(unlist(strsplit(val.vec,',')), ncol = 2L, byrow = TRUE)
#> names
# [,1] [,2]
#[1,] "Aabye" "? Edgar"
#[2,] "Aaltonen" "? Arvo"
#[3,] "Aaltonen" "? Paavo"
#[4,] "Aalvik Grimsb" "? Kari"
#[5,] "Aamodt" "? Kjetil Andr"
#[6,] "Aamodt" "? Ragnhild"
Testing it against the (very fast) solution proposed by Richard Scriven we can see yours and his are equivalent:
#> library(microbenchmark)
#> microbenchmark(
#+ names_1 = do.call(rbind, strsplit(val.vec, ",")),
#+ names_2 = matrix(unlist(strsplit(val.vec,',')), ncol = 2L, byrow = TRUE),
#+ times = 10000L
#+ )
#Unit: microseconds
# expr min lq mean median uq max neval cld
# names_1 12.596 13.530 15.08867 13.996 14.463 513.185 10000 b
# names_2 11.663 12.131 14.03413 12.597 13.530 1436.917 10000 a
回答4:
If you are into the dplyr
way of doing things, have a look at separate
from the tidyr
package:
library(dplyr)
library(tidyr)
dat = data.frame(val = c("Lee, John", "Lee, Spike", "Doe, John",
"Longstocking, Pippy", "Bond, James", "Jordan, Michael"))
# val
# 1 Lee, John
# 2 Lee, Spike
# 3 Doe, John
# 4 Longstocking, Pippy
# 5 Bond, James
# 6 Jordan, Michael
dat %>%
separate(val, c('last_name', 'first_name'), sep = ',') %>%
mutate(first_name = trimws(first_name))
# last_name first_name
# 1 Lee John
# 2 Lee Spike
# 3 Doe John
# 4 Longstocking Pippy
# 5 Bond James
# 6 Jordan Michael
Added in the call to trimws
to get rid of the leading whitespace.