可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a list of regular expressions (about 10 - 15) that I needed to match against some text. Matching them one by one in a loop is too slow. But instead of writing up my own state machine to match all the regexes at once, I am trying to | the individual regexes and let perl do the work. The problem is that how do I know which of the alternatives matched?

This question addresses the case where there are no capturing groups inside each individual regex. (which portion is matched by regex?) What if there are capturing groups inside each regexes?

So with the following,

/^(A(\d+))|(B(\d+))|(C(\d+))$/

and the string "A123", how can I both know that A123 matched and extract "123"?

回答1:

You don't need to code up your own state machine to combine regexes. Look into Regexp:Assemble. It has methods that'll track which of your initial patterns matched.

Edit:

use strict;
use warnings;

use 5.012;

use Regexp::Assemble;

my $string = 'A123';

my $re = Regexp::Assemble->new(track => 1);
for my $pattern (qw/ A(\d+) B(\d+) C(\d+) /) {
  $re->add($pattern);
}

say $re->re; ### (?-xism:(?:A(\d+)(?{0})|B(\d+)(?{2})|C(\d+)(?{1})))
say for $re->match($string); ### A(\d+)
say for $re->capture; ### 123

回答2:

Why not use /^ (?<prefix> A|B|C) (?<digits> \d+) $/x. Note, named capture groups used for clarity, and not essential.

回答3:

A123 will be in capture group $1 and 123 will be in group $2

So you could say:

if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ && $1 eq 'A123' && $2 eq '123' ) {
    ...
}

This is redundant, but you get the idea...

EDIT: No, you don't have to enumerate each sub match, you asked how to know whether A123 matched and how to extract 123:

You won't enter the if block unless A123 matched
and you can extract 123 using the $2 backreference.

So maybe this example would have been more clear:

if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ ) {
    # do something with $2, which will be '123' assuming $_ matches /^A123/
}

EDIT 2:

To capture matches in an AoA (which is a different question, but this should do it):

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my @matches = map { [$1,$2] if /^(?:(A|B|C)(\d+))$/ } <DATA>;
print Dumper \@matches;

__DATA__
A123
B456
C769

Result:

Note that I modified your regex, but it looks like that's what you're going for judging by your comment...

回答4:

With your example data, it is easy to write

'A123' =~ /^([ABC])(\d+)$/;

after which $1 will contain the prefix and $2 the suffix.

I cannot tell whether this is relevant to your real data, but to use an additional module seems like overkill.

回答5:

Another thing you can do in Perl is to embed Perl code directly in your regex using "(?{...})". So, you can set a variable that tells you which part of the regex matched. WARNING: your regex should not contain any variables (outside of the embedded Perl code), that will be interpolated into the regex or you will get errors. Here is a sample parser that uses this feature:

my $kind;
my $REGEX  = qr/
          [A-Za-z][\w]*                        (?{$kind = 'IDENT';})
        | (?: ==? | != | <=? | >=? )           (?{$kind = 'OP';})
        | -?\d+                                (?{$kind = 'INT';})
        | \x27 ( (?:[^\x27] | \x27{2})* ) \x27 (?{$kind = 'STRING';})
        | \S                                   (?{$kind = 'OTHER';})
        /xs;

my $line = "if (x == 'that') then x = -23 and y = 'say ''hi'' for me';";
my @tokens;
while ($line =~ /( $REGEX )/xsg) {
    my($match, $str) = ($1,$2);
    if ($kind eq 'STRING') {
        $str =~ s/\x27\x27/\x27/g;
        push(@tokens, ['STRING', $str]);
        }
    else {
        push(@tokens, [$kind, $match]);
        }
    }
foreach my $lItems (@tokens) {
    print("$lItems->[0]: $lItems->[1]\n");
    }

which prints out the following:

IDENT: if
OTHER: (
IDENT: x
OP: ==
STRING: that
OTHER: )
IDENT: then
IDENT: x
OP: =
INT: -23
IDENT: and
IDENT: y
OP: =
STRING: say 'hi' for me
OTHER: ;

It's kind of contrived, but you'll notice that the quotes (actually, apostrophes) around strings are stripped off (also, consecutive quotes are collapsed to single quotes), so in general, only the $kind variable will tell you whether the parser saw an identifier or a quoted string.