How do I tokenise a word given tokens that are sub

2019-06-28 03:39发布

I understand how to use regex in Perl in the following way:

$str =~ s/expression/replacement/g;

I understand that if any part of the expression is enclosed in parentheses, it can be used and captured in the replacement part, like this:

$str =~ s/(a)/($1)dosomething/;

But is there a way to capture the ($1) above outside of the regex expression?

I have a full word which is a string of consonants, e.g. bEdmA, its vowelized version baEodamaA (where a and o are vowels), as well its split up form of two tokens, separated by space, bEd maA. I want to just pick up the vowelized form of the tokens from the full word, like so: beEoda, maA. I'm trying to capture the token within the full word expression, so I have:

$unvowelizedword = "bEdmA";
$tokens[0] = "bEd", $tokens[1] = "mA";
$vowelizedword = "baEodamA";

foreach $t(@tokens) {
    #find the token within the full word, and capture its vowels
}

I'm trying to do something like this:

$vowelizedword = m/($t)/;

This is completely wrong for two reasons: the token $t is not present in exactly its own form, such as bEd, but something like m/b.E.d/ would be more relevant. Also, how do I capture it in a variable outside the regular expression?

The real question is: how can I capture the vowelized sequences baEoda and maA, given the tokens bEd, mA from the full word beEodamaA?


Edit

I realized from all the answers that I missed out two important details.

  1. Vowels are optional. So if the tokens are : "Al" and "ywm", and the fully vowelized word is "Alyawmi", then the output tokens would be "Al" and "yawmi".
  2. I only mentioned two vowels, but there are more, including symbols made up of two characters, like '~a'. The full list (although I don't think I need to mention it here) is:

    @vowels = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F', 'K', '~N', '~K');

标签: regex perl token
5条回答
在下西门庆
2楼-- · 2019-06-28 04:18

Use the m// operator in so-called "list context", as this:

my @tokens = ($input =~ m/capturing_regex_here/modifiershere);

查看更多
▲ chillily
3楼-- · 2019-06-28 04:27

ETA: From what I understand now, what you were trying to say is that you want to match an optional vowel after each character of the tokens.

With this, you can tweak the $vowels variable to only contain the letters you seek. Optionally, you may also just use . to capture any character.

use strict;
use warnings;
use Data::Dumper;

my @tokens = ("bEd", "mA");
my $full = "baEodamA";

my $vowels = "[aeiouy]";
my @matches;
for my $rx (@tokens) {
    $rx =~ s/.\K/$vowels?/g;
    if ($full =~ /$rx/) {
        push @matches, $full =~ /$rx/g;
    }
}

print Dumper \@matches;

Output:

$VAR1 = [
          'baEoda',
          'mA'
        ];

Note that

... $full =~ /$rx/g;

does not require capturing groups in the regex.

查看更多
叼着烟拽天下
4楼-- · 2019-06-28 04:27

Assuming the tokens need to appear in order and without anything (other than a vowel) between them:

my @tokens = ( "bEd", "mA" );
my $vowelizedword = "baEodamaA";

my $vowels = '[ao]';
my (@vowelized_sequences) = $vowelizedword =~ ( '^' . join( '', map "(" . join( $vowels, split( //, $_ ) ) . "(?:$vowels)?)", @tokens ) . '\\z' );
print for @vowelized_sequences;
查看更多
Lonely孤独者°
5楼-- · 2019-06-28 04:29

The following seems to do what you want:

#!/usr/bin/env perl
use warnings;
use strict;

my @tokens = ('bEd', 'mA');
my $vowelizedword = "beEodamaA";

my @regex = map { join('.?', split //) . '.?' } @tokens;

my $regex = join('|', @regex);
$regex = qr/($regex)/;

while (my ($matched) = $vowelizedword =~ $regex) {
    $vowelizedword =~ s{$regex}{};
    print "matched $matched\n";
}

Update as per your updated question (vowels are optional). It works from the end of the string so you'll have to gather the tokens into an array and print them in reverse:

#!/usr/bin/env perl
use warnings;
use strict;

my @tokens = ('bEd', 'mA', 'Al', 'ywm');
my $vowelizedword = "beEodamaA Alyawmi"; # Caveat: Without the space it won't work.

my @regex = map { join('.?', split //) . '.?$' } @tokens;

my $regex = join('|', @regex);
$regex = qr/($regex)/;

while (my ($matched) = $vowelizedword =~ $regex) {
        $vowelizedword =~ s{$regex}{};
            print "matched $matched\n";
}
查看更多
放荡不羁爱自由
6楼-- · 2019-06-28 04:40

I suspect that there is an easier way to do whatever you're trying to accomplish. The trick is not to make the regex generation code so tricky that you forget what it's actually doing.

I can only begin to guess at your task, but from your single example, it looks like you want to check that the two subtokens are in the larger token, ignoring certain characters. I'm going to guess that those sub tokens have to be in order and can't have anything else between them besides those vowel characters.

To match the tokens, I can use the \G anchor with the /g global flag in scalar context. This anchors the match to the character one after the end of the last match for the same scalar. This way allows me to have separate patterns for each sub token. This is much easier to manage since I only need to change the list of values in @subtokens.

Once you go through each of the pairs and find which ones match all the patterns, I can extract the original string from the pair.

use v5.14;

my $vowels    = '[ao]*';
my @subtokens = qw(bEd mA);

# prepare the subtoken regular expressions
my @patterns = map {
    my $s = join "$vowels", map quotemeta, (split( // ), '');
    qr/$s/;
    } @subtokens;

my @tokens = qw( baEodamA mAabaEod baEoda mAbaEoda );

my @grand_matches;
TOKEN: foreach my $token ( @tokens ) {
    say "-------\nMatching $token..........";
    my @matches;
    PATTERN: foreach my $pattern ( @patterns ) {
        say "Position is ", pos($token) // 0;

        # scalar context /g and \G
        next TOKEN unless $token =~ /\G($pattern)/g; 
        push @matches, $1;
        say "Matched with $pattern";
        }
    push @grand_matches, [ $token, \@matches ];
    }

# Now report the original   
foreach my $tuple ( @grand_matches ) {
    say "$tuple->[0] has both fragments: @{$tuple->[1]}";
    }

Now, here's the nice thing about this structure. I've probably guessed wrong about your task. If I have, it's easy to fix without changing the setup. Let's say that the subtokens don't have to be in order. That's an easy change to the pattern I created. I just get rid of the \G anchor and the /g flag;

        next TOKEN unless $token =~ /($pattern)/; 

Or, suppose that the tokens have to be in order, but other things may be between them. I can insert a .*? to match that stuff, effectively skipping over it:

        next TOKEN unless $token =~ /\G.*?($pattern)/g; 

It would be much nicer if I could manage all of this from the map where I create the patterns, but the /g flag isn't a pattern flag. It has to go with the operator.

I find it much easier to manage changing requirements when I don't wrap everything in a single regular expression.

查看更多
登录 后发表回答