difference between [0-9] n times and [0-9]{n} in R

2020-08-09 09:53发布

问题:

Both are supposed to the best of my knowledge to be the same but I actually see a difference, look at this minimal example from this question:

a<-c("/Cajon_Criolla_20141024","/Linon_20141115_20141130",
"/Cat/LIQUID",
"/c_puertas_20141206_20141107",
"/C_Puertas_3_20141017_20141018",
"/c_puertas_navidad_20141204_20141205")

sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
[1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107" "20141018"   
[6] "20141205"   

sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206" "20141017"   
[6] "20141204" 

What am I missing? Or is this a bug?

回答1:

This is a bug in the TRE library related to greedy modifiers and capture groups. See:

  • SO question with similar issue
  • Issue #11 on TRE repo
  • Issue #21.


回答2:

Setting perl=TRUE gives the same answer (as expected) for both expressions:

> sub("(.*?)_([0-9]{8})(.*)$","\\2",a,perl=TRUE)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"    "20141204"   
> sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a,perl=TRUE)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"    "20141204"


回答3:

Though I was initially convinced by BrodieG answer, it seems that [0-9] n times and [0-9]{n} are indeed different, at least for the "tre" regexp motor. According to http://www.regular-expressions.info the {} operator is greedy, [0-9] is not.

Hence the right regular expression in my case should be:

    sub("(.*?)_([0-9]{8}?)(.*)$","\\2",a)

Making all the difference:

    sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
    [1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107"    "20141018"   
    [6] "20141205"   
    sub("(.*?)_([0-9]{8}?)(.*)$","\\2",a)
    [1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"   
    [6] "20141204"

And even

    > sub("(.*)_([0-9]{8}?)(.*)$","\\2",a)
    [1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"   
    [6] "20141204"

Interpretation: 1) tre considers ? as "evaluate next atom the first time you can match this atom". This is always true for ".?" as everything matches, and it switches to _[0-9]{8}. When reaching the first group of 6 numbers, if {} is not greedy (no ?), as (.) matches also the first 8 numbers, the search continues to see if an other occurrence of "_[0-9]{8}" can be found on the line. If meeting the second set of 8 figures, it also memorizes it as a matching pattern, then it reaches the end of the line, the last matching pattern is kept and [0-9]{8} is matched to the second set of 8 numbers.

2) When {} operator is modified by ? The search stops the first time it sees 8 numbers, check if _(.*) can be matched to the rest. It can, so it returns the first set of 8 numbers.

Note that the perl regexp motor works differently,

1) ? after {} doesn't change a thing:

     sub("(.*)_([0-9]{8})","\\2",a,perl=TRUE)
     [1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107"    "20141018"   
     [6] "20141205"   
     sub("(.*)_([0-9]{8}?)","\\2",a,perl=TRUE)
     [1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107"    "20141018"   
     [6] "20141205"

2) the ? applied to .* makes it to stop at the first set of 8 figures:

     sub("(.*?)_([0-9]{8}).*","\\2",a,perl=TRUE)
     [1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"   
     [6] "20141204"   
     sub("(.*)_([0-9]{8}).*","\\2",a,perl=TRUE)
     [1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107"    "20141018"   
     [6] "20141205" 

From these two observations, it seems that the two engines interpret differently the greediness in two different instances. I always found the greediness concept to be a bit vague ...



标签: regex r