I have a list of regular expressions (about 10 - 15) that I needed to match against some text. Matching them one by one in a loop is too slow. But instead of writing up my own state machine to match all the regexes at once, I am trying to |
the individual regexes and let perl do the work. The problem is that how do I know which of the alternatives matched?
This question addresses the case where there are no capturing groups inside each individual regex. (which portion is matched by regex?) What if there are capturing groups inside each regexes?
So with the following,
/^(A(\d+))|(B(\d+))|(C(\d+))$/
and the string "A123", how can I both know that A123 matched and extract "123"?
You don't need to code up your own state machine to combine regexes. Look into Regexp:Assemble. It has methods that'll track which of your initial patterns matched.
Edit:
use strict;
use warnings;
use 5.012;
use Regexp::Assemble;
my $string = 'A123';
my $re = Regexp::Assemble->new(track => 1);
for my $pattern (qw/ A(\d+) B(\d+) C(\d+) /) {
$re->add($pattern);
}
say $re->re; ### (?-xism:(?:A(\d+)(?{0})|B(\d+)(?{2})|C(\d+)(?{1})))
say for $re->match($string); ### A(\d+)
say for $re->capture; ### 123
Why not use /^ (?<prefix> A|B|C) (?<digits> \d+) $/x
. Note, named capture groups used for clarity, and not essential.
A123
will be in capture group $1
and 123
will be in group $2
So you could say:
if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ && $1 eq 'A123' && $2 eq '123' ) {
...
}
This is redundant, but you get the idea...
EDIT: No, you don't have to enumerate each sub match, you asked how to know whether A123
matched and how to extract 123
:
- You won't enter the
if
block unless A123
matched
- and you can extract
123
using the $2
backreference.
So maybe this example would have been more clear:
if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ ) {
# do something with $2, which will be '123' assuming $_ matches /^A123/
}
EDIT 2:
To capture matches in an AoA (which is a different question, but this should do it):
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @matches = map { [$1,$2] if /^(?:(A|B|C)(\d+))$/ } <DATA>;
print Dumper \@matches;
__DATA__
A123
B456
C769
Result:
$VAR1 = [
[
'A',
'123'
],
[
'B',
'456'
],
[
'C',
'769'
]
];
Note that I modified your regex, but it looks like that's what you're going for judging by your comment...
With your example data, it is easy to write
'A123' =~ /^([ABC])(\d+)$/;
after which $1 will contain the prefix and $2 the suffix.
I cannot tell whether this is relevant to your real data, but to use an additional module seems like overkill.
Another thing you can do in Perl is to embed Perl code directly in your regex using "(?{...})". So, you can set a variable that tells you which part of the regex matched. WARNING: your regex should not contain any variables (outside of the embedded Perl code), that will be interpolated into the regex or you will get errors. Here is a sample parser that uses this feature:
my $kind;
my $REGEX = qr/
[A-Za-z][\w]* (?{$kind = 'IDENT';})
| (?: ==? | != | <=? | >=? ) (?{$kind = 'OP';})
| -?\d+ (?{$kind = 'INT';})
| \x27 ( (?:[^\x27] | \x27{2})* ) \x27 (?{$kind = 'STRING';})
| \S (?{$kind = 'OTHER';})
/xs;
my $line = "if (x == 'that') then x = -23 and y = 'say ''hi'' for me';";
my @tokens;
while ($line =~ /( $REGEX )/xsg) {
my($match, $str) = ($1,$2);
if ($kind eq 'STRING') {
$str =~ s/\x27\x27/\x27/g;
push(@tokens, ['STRING', $str]);
}
else {
push(@tokens, [$kind, $match]);
}
}
foreach my $lItems (@tokens) {
print("$lItems->[0]: $lItems->[1]\n");
}
which prints out the following:
IDENT: if
OTHER: (
IDENT: x
OP: ==
STRING: that
OTHER: )
IDENT: then
IDENT: x
OP: =
INT: -23
IDENT: and
IDENT: y
OP: =
STRING: say 'hi' for me
OTHER: ;
It's kind of contrived, but you'll notice that the quotes (actually, apostrophes) around strings are stripped off (also, consecutive quotes are collapsed to single quotes), so in general, only the $kind variable will tell you whether the parser saw an identifier or a quoted string.