Regular expression in regards to question mark “la

2019-03-05 05:58发布

问题:

I understand the ? mark here means "lazy".

My question essentially is [0-9]{2}? vs [0-9]{2}

Are they same?
If so, why are we writing the former expression? Aren't lazy mode more expensive performance wise?
If not, can you tell the difference?

回答1:

There is not a difference between [0-9]{2} and [0-9]{2}?.

The difference between greedy matching and lazy matching (the addition of a ?) has to do with backtracking. Regular expression engines are built to match text (from left to right). Therefore it is logical that when you ask an expression to match a range of character(s), it matches as many as possible.


Assume we have the string acac123.

If we use a greedy match of [a-z]+c (+ standing for 1+ repetitions or {1,}):

  • [a-z]+ would match acac and fail at 1
  • then we would try to match the c, but fail at 1
  • now we start backtracking, and successfully match aca and c

If we make this lazy ([a-z]+?c), we will get both a different response (in this case) and be more efficient:

  • [a-z]+? would match a, but stop because it sees the next character matches the rest of the expression c
  • the c would then match, successfully matching a and c (with no backtracking)

Now you can see that there will be no difference between X{#} and X{#}?, because {#} is not a range and even a greedy match will not experience any backtracking. Lazily matches are often used with * (0+ repetitions or {0,}) or +, but can also be used with ranges {m,n} (where n is optional).

This is essential when you want to match the least amount of characters possible and you will often see .*? in an expression when you want to fill up some space (foo.*?bar on a string foo bar filler text bar). However, many times a lazy match is an example of bad/inefficient regex. Many people will do something like foo:"(.*?)" to match everything within double quotes, when you can avoid a lazy match by writing your expression like foo:"([^"]+)" and match anything but "s.


Final note, ? typically means "optional" or match {0,1} times. ? only will make a match lazy if you use it on a range ({m,n}, *, +, or another ?). This means X? will not make X lazy (since we already said {#}? is pointless), but instead it will be optional. However, you can do a lazy "optional" match: [0-9]?? will lazily match 0-1 times.



回答2:

What's "lazy" (reluctant) matching?

When matching with regex, the pointer is greedy by default:

Left | Right
\d+    12345
^      ^
\d+    12345
  ^    ^^^^^ Matched!

Lazy is the opposite of greedy:

Left | Right
\d+?   12345
^      ^
\d+?   12345
  ^^    ^
       12345
         ^
       12345
          ^
       12345
           ^ Matched!

Why does it matter?

In matches, quantifiers * + ? are greedy by default. This can lead to unwanted behaviour, especially when we would like certain characters only to match when necessary for match to complete, and omit otherwise.

A typical example is when we want to match a single XML tag: We will fail this with <.*>.

Left | Right
<.*>   <p>hi</p><br /><p>bye</p>
^      ^
<.*>   <p>hi</p><br /><p>bye</p>
 ^^     ^^^^^^^^^^^^^^^^^^^^^^^^
<.*>   <p>hi</p><br /><p>bye</p>
   ^                           < [backtrack!]
<.*>   <p>hi</p><br /><p>bye</p>
   ^                           ^ Matched "<p>hi</p><br /><p>bye</p>"!
Left* | Right
<.*?>   <p>hi</p><br /><p>bye</p>
^       ^
<.*?>   <p>hi</p><br /><p>bye</p>
 ^^^     ^ [can we stop? we're lazy [yes]]
<.*?>   <p>hi</p><br /><p>bye</p>
    ^     ^ Matched "<p>"!

What can I quantify as lazy?

You can add the ? construct behind quantifiers and ranges:

+ (one or more), * (zero or more), ? (optional);
{n,m} (between n and m where n < m), {n,} (n or more), {n} (exactly n times).
(n and m in the examples are real numbers and satisfies n, m ϵ N)

  1.     Reluctant quantifiers are unwilling to push on.
    The match is allowed to match as much as possible or as little as possible, under the consideration that the engine only attempts to match when absolutely necessary for the rest of the rest to succeed. See the following cases:

    Left | Right
    abc*   abccccd
    ^      ^
    abc*   abccccd
     ^      ^
    abc*   abccccd
      ^      ^
    abc*   abccccd
      ^^     ^^^^ Matched "abcccc"!
    
    Left* | Right
    abc*?   abccccd
    ^       ^
    abc*?   abccccd
     ^       ^
    abc*?   abccccd
      ^^^     ^ [must we do this? we're lazy [no]]
               Matched "ab"!
    

    As demonstrated, they match as few as possible.

  2.     Reluctant quantifiers give up to entertain other quantifiers.
    (Demonstration purposes; If anyone asked, I did not tell you it is OK to use RegExp like this.)

    Left | Right
    c+c+   abccccd
    ^        ^
    c+c+   abccccd
    ^^       ^^^^
    c+c+   abccccd
      ^         < [backtrack]
    c+c+   abccccd
      ^^        ^ Matched "cccc"!
                  (c+ -> @ccc; c+ -> @c)
    
    Left* | Right
    c+?c+   abccccd
    ^         ^
    c+?c+   abccccd
    ^^^       ^ [pass]
    c+?c+   abccccd
       ^^      ^^^ Matched "cccc"!
                   (c+? -> @c; c+ -> @c)
    
  3.     Exact range quantifiers are not affected.
    Between X{n} and X{n}?, there are virtually no differences; And most engines internally optimizes away the reluctant flag. This is because the lazy construct only applies when the match is dynamic, of which the engine can behave one way or the other for the quantifier (need or greed), but is not applicable for this case.

Check out regex101, a well-done regular expression engine which comes with explanation and a debugger log to show you the pointer steps. Also read The Stack Overflow Regex Reference!