How can I tell which of the alternatives matched i

I have a list of regular expressions (about 10 - 15) that I needed to match against some text. Matching them one by one in a loop is too slow. But instead of writing up my own state machine to match all the regexes at once, I am trying to | the individual regexes and let perl do the work. The problem is that how do I know which of the alternatives matched?

This question addresses the case where there are no capturing groups inside each individual regex. (which portion is matched by regex?) What if there are capturing groups inside each regexes?

So with the following,

/^(A(\d+))|(B(\d+))|(C(\d+))$/

and the string "A123", how can I both know that A123 matched and extract "123"?

标签： regex perl capture regex-group

5条回答

SAY GOODBYE

2楼-- · 2019-03-31 07:24

A123 will be in capture group $1 and 123 will be in group $2

So you could say:

if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ && $1 eq 'A123' && $2 eq '123' ) {
    ...
}

This is redundant, but you get the idea...

EDIT: No, you don't have to enumerate each sub match, you asked how to know whether A123 matched and how to extract 123:

You won't enter the if block unless A123 matched
and you can extract 123 using the $2 backreference.

So maybe this example would have been more clear:

if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ ) {
    # do something with $2, which will be '123' assuming $_ matches /^A123/
}

EDIT 2:

To capture matches in an AoA (which is a different question, but this should do it):

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my @matches = map { [$1,$2] if /^(?:(A|B|C)(\d+))$/ } <DATA>;
print Dumper \@matches;

__DATA__
A123
B456
C769

Result:

Note that I modified your regex, but it looks like that's what you're going for judging by your comment...

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-03-31 07:28

You don't need to code up your own state machine to combine regexes. Look into Regexp:Assemble. It has methods that'll track which of your initial patterns matched.

Edit:

use strict;
use warnings;

use 5.012;

use Regexp::Assemble;

my $string = 'A123';

my $re = Regexp::Assemble->new(track => 1);
for my $pattern (qw/ A(\d+) B(\d+) C(\d+) /) {
  $re->add($pattern);
}

say $re->re; ### (?-xism:(?:A(\d+)(?{0})|B(\d+)(?{2})|C(\d+)(?{1})))
say for $re->match($string); ### A(\d+)
say for $re->capture; ### 123

0人赞添加讨论(0) 举报

Viruses.

4楼-- · 2019-03-31 07:28

With your example data, it is easy to write

'A123' =~ /^([ABC])(\d+)$/;

after which $1 will contain the prefix and $2 the suffix.

I cannot tell whether this is relevant to your real data, but to use an additional module seems like overkill.

0人赞添加讨论(0) 举报

Emotional °昔

5楼-- · 2019-03-31 07:35

Another thing you can do in Perl is to embed Perl code directly in your regex using "(?{...})". So, you can set a variable that tells you which part of the regex matched. WARNING: your regex should not contain any variables (outside of the embedded Perl code), that will be interpolated into the regex or you will get errors. Here is a sample parser that uses this feature:

my $kind;
my $REGEX  = qr/
          [A-Za-z][\w]*                        (?{$kind = 'IDENT';})
        | (?: ==? | != | <=? | >=? )           (?{$kind = 'OP';})
        | -?\d+                                (?{$kind = 'INT';})
        | \x27 ( (?:[^\x27] | \x27{2})* ) \x27 (?{$kind = 'STRING';})
        | \S                                   (?{$kind = 'OTHER';})
        /xs;

my $line = "if (x == 'that') then x = -23 and y = 'say ''hi'' for me';";
my @tokens;
while ($line =~ /( $REGEX )/xsg) {
    my($match, $str) = ($1,$2);
    if ($kind eq 'STRING') {
        $str =~ s/\x27\x27/\x27/g;
        push(@tokens, ['STRING', $str]);
        }
    else {
        push(@tokens, [$kind, $match]);
        }
    }
foreach my $lItems (@tokens) {
    print("$lItems->[0]: $lItems->[1]\n");
    }

which prints out the following:

IDENT: if
OTHER: (
IDENT: x
OP: ==
STRING: that
OTHER: )
IDENT: then
IDENT: x
OP: =
INT: -23
IDENT: and
IDENT: y
OP: =
STRING: say 'hi' for me
OTHER: ;

It's kind of contrived, but you'll notice that the quotes (actually, apostrophes) around strings are stripped off (also, consecutive quotes are collapsed to single quotes), so in general, only the $kind variable will tell you whether the parser saw an identifier or a quoted string.

0人赞添加讨论(0) 举报

虎瘦雄心在

6楼-- · 2019-03-31 07:43

Why not use /^ (?<prefix> A|B|C) (?<digits> \d+) $/x. Note, named capture groups used for clarity, and not essential.

0人赞添加讨论(0) 举报

How can I tell which of the alternatives matched i

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间