Perl regex matching optional phrase in longer sent

2020-02-26 13:06发布

I'm trying to match an optional (possibly present) phrase in a sentence:

perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3\n" if m/(word1).*(word2)?.*(word3)/'

Output:

1:word1 2: 3:word3

I know the first '.*' is being greedy and matching everything up to 'word3'. Making it non-greedy doesn't help:

perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3\n" if m/(word1).*?(word2)?.*(word3)/'

Output:

1:word1 2: 3:word3 

There seems to be a conflict of interest here. I would have thought Perl would match (word2)? if possible and still satify the non-greedy .*?. At least that's my understanding of '?'. The Perl regex page says '?' makes 1 or zero times so shouldn't it prefer one match rather than zero?

Even more confusing is if I capture the .*?:

perl -e '$_="word1 word2 word3"; print "1:$1 2:$2 3:$3 4:$4\n" if m/(word1)(.*?)(word2)?.*(word3)/'

Output:

1:word1 2: 3: 4:word3

All groups here are capturing groups so I don't know why they are empty.

Just to make sure the inter-word space isn't being captured:

perl -e '$_="word1_word2_word3"; print "1:$1 2:$2 3:$3 4:$4\n" if m/(word1)(.*?)(word2)?.*(word3)/'

Output:

1:word1 2: 3: 4:word3

Given the only match not capturing is the one between word2 and word3 I can only assume that it's the one doing the matching. Sure enough:

perl -e '$_="word1_word2_word3"; print "1:$1 2:$2 3:$3 4:$4 5:$5\n" if m/(word1)(.*?)(word2)?(.*)(word3)/'

Output:

1:word1 2: 3: 4:_word2_ 5:word3

So the greedy matching is working backwards, and Perl is happy to match zero (rather than one) instance of word2. Making it non-greedy doesn't help either.

So my question is: how can I write my regex to match and capture a possible phrase in a sentence? My examples given here are simplistic; the actual sentence I am parsing is much longer with many words between those I am matching, so I can't assume any length or composition of intervening text.

Many thanks, Scott

标签: regex perl
3条回答
疯言疯语
2楼-- · 2020-02-26 13:39

In order to solve your issue, you have to observe that the catch-all subexpression in your regex match material that you do not want them to:

 (word1).*(word2)?.*(word3)
        --
         ^--- this subexpression matches _all_ material between `word1` and `word3` in the test string, in particular `word2` if it is present

 (word1).*? (word2)? .*(word3)
        ---+--------+--
         ^       ^   ^-- this subexpression matches _all_ material between `word1` and `word3` in the test string, in particular `word2` if it is present
         |       |
         |       +------ this subexpression is empty, even if `word2` is present:
         |               - the preceding subexpression `.*?` matches minimally (ie. the empty string)
         |               - `(word2)?` cannot match for the preceding blank.
         |               - the following subexpression `.*` matches everything up to `word3`, including `word2`.
         |
         |               -> the pattern matches _as desired_ for test strings
         |                  where `word2` immediately follows `word1` without  
         |
         +-------------- this subexpression will always be empty

What you need is a construction that prevents the catch-all to match strings that contain word2. Luckily, perl's regex syntax sports the negative lookbehind that serves the purpose: for each character in the match of the catch-all subexpression, make sure that it is not preceded by word2.

In perl:

/(word1).*(word2).*(word3)|word1((?<!word2).)*word3/

Caveats

  1. This might be a performance hog.
  2. Note that word2 must be a literal, as the regex engine only supports patterns with match lengths known a priori.

Alternative solution

Given the Caveats you might try to alter the control logic:

$teststring = $_;
if ($teststring =~ m/(word1).*(word2).*(word3)/) {
    print \"1:$1 2:$2 3:$3\n\";
}
else {
    # You know by now that there is no word2 between any word1, word3 occurrences 
    if ($teststring =~ m/(word1).*(word3)/) {
        print \"1:$1 2:- 3:$2\n\";
    }
}
查看更多
够拽才男人
3楼-- · 2020-02-26 14:01

You can use the branch reset construct as a workaround:

 (word1)(?|.*?(word2).*?(word3)|().*?(word3))
#^            ^         ^       ^    ^---- group 3
#|            |         |       '--------- group 2
#|            |         '----------------- group 3
#|            '--------------------------- group 2
#'---------------------------------------- group 1

The main interest of a branch reset group (?|...()...()|...()...()) is that capture groups have the same numbers in each branch. Instead of making the group 2 optional, you can use a first branch where the group is mandatory, and a second where it is empty (or you can populate it with an always failing pattern and add a ? after it).

查看更多
该账号已被封号
4楼-- · 2020-02-26 14:02

BACKGROUND: HOW LAZY AND GREEDY QUANTIFIERS WORK

You need to understand how greedy and lazy quantifiers work. Greedy ones will grab the text their patterns can match at once, and then the engine will backtrack, i.e. it will try to go back to the place where the greedily quantified subpattern matched the substring, trying to check if the next subpattern can be matched.

Lazy matching patterns just match the minimum characters first, and then tries to match with the rest of the subpatterns. With *?, it matches zero characters, an empty space, and then goes on to check if the next pattern can be matched, and only if it cannot, the lazy subpattern will be "expanded" to include one more character, and so on.

So, (word1).*(word2)?.*(word3) will match the word2 with the first .* (and the second .* will match an empty space as the first .* is greedy. Although you can think that (word2)? is greedy and thus must be backtracked to, the answer is no, because the the first .* grabbed all the string, and then the engine went backwards looking for the match. Since (word2)? matches an empty string, it always matched, and word3 was matched first from the end of the string. See this demo and check the regex debugger section.

You thought, let's use lazy matching with the first .\*?. The problem with (word1).*?(word2)?.*(word3) (that matches word2 with the second .* that is greedy) is a bit different as it could match the optional group. How? The first .*? matches zero characters, then tries to match all subsequent subpatterns. Thus, it found word1, then an empty string, and did not find the word2 right after word1. If word2 were right after word1, there would be a match with the first .*?. See this demo.

SOLUTION

There are two solutions that I see at this moment, and they both consist in making the second optional group "exclusive" for the rest of the pattern, so that the regex engine could not skip it if found.

  • A branch reset solution provided by Casimir above. Its disadvantage is that it cannot be ported to many other regex flavors that do not support branch reset. See description in the original answer.
  • Use a tempered greedy token: (word1)(?:(?!word2).)*(word2)?.*?(word3). It is less efficient than the branch reset solution, but can be ported to JS, Python, and most other regex flavors supporting lookaheads. How does that work? (?:(?!word2).)* matches 0+ occurrences of any character other than a newline (with /s, even including a newline) that does not start a literal character sequence word2. If w is matched, it cannot be followed with ord2 for the construct to match. Thus, when it reaches word2, it stops and lets the subsequent subpattern - (word2)? - match and capture the following word2. *To make this approach more efficient**, use unroll the loop technique: (word1)[^w]*(?:w(?!ord2)[^w]*)*(word2)?.*?(word3).
查看更多
登录 后发表回答