Split strings on third white space from the right

2019-08-06 04:02发布

问题:

I would like to split a series of strings on the third white space from the right. The number of white spaces varies among string, but each string has at least three white spaces. Here are two example strings.

strings <- c('abca eagh   ijkl mnop', 'dd1 ss j, ll bb aa')

I would like:

[1] 'abca', 'eagh   ijkl mnop' 
[2] 'dd1 ss j,', 'll bb aa'

The closest I have been able to come is:

strsplit(strings, split = "(?<=\\S)(?=\\s(.*)\\s(.*)\\s(.*)$)", perl = TRUE)

which returns:

[[1]]
[1] "abca"         " eagh"        "   ijkl mnop"

[[2]]
[1] "dd1"       " ss"       " j,"       " ll bb aa"

I keep thinking the answer should be something like:

strsplit(strings, split = "(?<=\\S\\s(.*)\\s(.*)\\s(.*)$)(?=\\s(.*)\\s(.*)\\s(.*)$)", perl = TRUE)

However, that returns an error. Thank you for any advice. I prefer a solution in base, hopefully one that uses regular expressions.

回答1:

Try the expression:

(?=(?>\\s\\S*){3}$)\\s

Edit: Use this expression if you want consecutive whitespace characters to be treated as 'one' whitespace:

(?=(?>\\s+\\S*){3}$)\\s

It's worth noting that the reason your expression was causing an error is most likely because most regex engines do not permit variable width lookbehinds. In your example that would be the * quantifier in the lookbehind breaking the rules.

Got it! Sorry I wasn't 100% on how the strsplit function worked. Try this:

strsplit(strings, split = "(?=(?>\\s+\\S*){3}$)\\s", perl = TRUE)

Here is an example output:

> strings <- c('abca eagh   ijkl mnop', 'dd1 ss j, ll bb aa')
> strsplit(strings, split = "(?=(?>\\s+\\S*){3}$)\\s", perl = TRUE)
[[1]]
[1] "abca"             "eagh   ijkl mnop"

[[2]]
[1] "dd1 ss j," "ll bb aa" 


回答2:

How about using the following regex: (\S*\s*\S*\s*\S*\s*)(.*)? See http://regex101.com/r/lI7aA9