strsplit inconsistent with gregexpr

2020-07-03 04:13发布

问题:

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.

So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?

#  We would like to split on the first comma and
#  the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"

#  Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34"  "56"  "78"  "90" 


#  Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )

# Matching positions are at
unlist(m)
[1]  4 13

#  And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","

Huh?! What is going on?

回答1:

The theory of @Aprillion is exact, from R documentation:

The algorithm applied to each input string is

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)

To simply illustrate this behavior:

> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""

Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to @JoshO'Brien for the link.)