R Strsplit keep delimiter in second element

2019-07-25 11:20发布

问题:

I have been trying to solve this little issue for almost 2 hours, but without success. I simply want to separate a string by the delimiter: one space followed by any character. In the second element I want to keep the delimiter, whereas in the first element it shall not appear. Example:

 x <- "123123 123 A123"
 strsplit(x," [A-Z]")

results in:

"123123 123" "A123"

However, this does not keep the letter A in the second element. I have tried using

strsplit(x,"(?<=[A-Z])",perl=T)

but this does not really work for my issue. It would also be okay, if there is a space in the second element, it just need the character in it.

回答1:

If you want to follow your approach, you need to match 1+ whitespaces followed (i.e. you need a lookahead here) with a letter to consume the whitespaces:

> strsplit(x,"\\s+(?=[A-Z])",perl=T)
[[1]]
[1] "123123 123" "A123"

See the PCRE regex demo.

Details:

  • \s+ - 1 or more whitespaces (put into the match value and thus will be removed during splitting)
  • (?=[A-Z]) - the uppercase ASCII letter must appear immediately to the right of the current location, else fail the match (the letter is not part of the match value, and will be kept in the result)

You may also match up to the last non-whitespace char followed with 1+ whitespaces and use \K match reset operator to discard the match before the whitespace:

> strsplit(x,"^.*\\S\\K\\s+",perl=T)
[[1]]
[1] "123123 123" "A123"  

If the string contains line breaks, add a DOTALL flag since a dot in a PCRE regex does not match line breaks by default: "(?s)^.*\\S\\K\\s+".

Details:

  • ^ - start of string
  • .* - any 0+ chars up to the last occurrence of the subsequent subpatterns (that is, \S\s+)
  • \\S - a non-whitespace
  • \\K - here, drop all the text matched so far
  • \\s+ - 1 or more whitespaces.

See another PCRE regex demo.



回答2:

I would go with stringi package:

library(stringi)
x <- c("123123 123 A123","34512 321 B521")#some modified input data

l1<-stri_split(x,fixed=" ")
[1] "123123" "123"    "A123"  

Then:

lapply(seq_along(1:length(l1)),  function(x) c(paste0(l1[[x]][1]," ",l1[[x]][2]),l1[[x]][3]))

[[1]] 
[1] "123123 123" "A123"      

[[2]]
[1] "34512 321" "B521"    


标签: r regex strsplit