I understand how to use regex in Perl in the following way:
$str =~ s/expression/replacement/g;
I understand that if any part of the expression is enclosed in parentheses, it can be used and captured in the replacement part, like this:
$str =~ s/(a)/($1)dosomething/;
But is there a way to capture the ($1)
above outside of the regex expression?
I have a full word which is a string of consonants, e.g. bEdmA
, its vowelized version baEodamaA
(where a
and o
are vowels), as well its split up form of two tokens, separated by space, bEd maA
. I want to just pick up the vowelized form of the tokens from the full word, like so: beEoda
, maA
. I'm trying to capture the token within the full word expression, so I have:
$unvowelizedword = "bEdmA";
$tokens[0] = "bEd", $tokens[1] = "mA";
$vowelizedword = "baEodamA";
foreach $t(@tokens) {
#find the token within the full word, and capture its vowels
}
I'm trying to do something like this:
$vowelizedword = m/($t)/;
This is completely wrong for two reasons: the token $t
is not present in exactly its own form, such as bEd
, but something like m/b.E.d/
would be more relevant. Also, how do I capture it in a variable outside the regular expression?
The real question is: how can I capture the vowelized sequences baEoda
and maA
, given the tokens bEd
, mA
from the full word beEodamaA
?
Edit
I realized from all the answers that I missed out two important details.
- Vowels are optional. So if the tokens are : "Al" and "ywm", and the fully vowelized word is "Alyawmi", then the output tokens would be "Al" and "yawmi".
I only mentioned two vowels, but there are more, including symbols made up of two characters, like '~a'. The full list (although I don't think I need to mention it here) is:
@vowels = ('a', 'i', 'u', 'o', '~', '~a', '~i', '~u', 'N', 'F', 'K', '~N', '~K');
Use the
m//
operator in so-called "list context", as this:my @tokens = ($input =~ m/capturing_regex_here/modifiershere);
ETA: From what I understand now, what you were trying to say is that you want to match an optional vowel after each character of the tokens.
With this, you can tweak the
$vowels
variable to only contain the letters you seek. Optionally, you may also just use.
to capture any character.Output:
Note that
does not require capturing groups in the regex.
Assuming the tokens need to appear in order and without anything (other than a vowel) between them:
The following seems to do what you want:
Update as per your updated question (vowels are optional). It works from the end of the string so you'll have to gather the tokens into an array and print them in reverse:
I suspect that there is an easier way to do whatever you're trying to accomplish. The trick is not to make the regex generation code so tricky that you forget what it's actually doing.
I can only begin to guess at your task, but from your single example, it looks like you want to check that the two subtokens are in the larger token, ignoring certain characters. I'm going to guess that those sub tokens have to be in order and can't have anything else between them besides those vowel characters.
To match the tokens, I can use the
\G
anchor with the/g
global flag in scalar context. This anchors the match to the character one after the end of the last match for the same scalar. This way allows me to have separate patterns for each sub token. This is much easier to manage since I only need to change the list of values in@subtokens
.Once you go through each of the pairs and find which ones match all the patterns, I can extract the original string from the pair.
Now, here's the nice thing about this structure. I've probably guessed wrong about your task. If I have, it's easy to fix without changing the setup. Let's say that the subtokens don't have to be in order. That's an easy change to the pattern I created. I just get rid of the
\G
anchor and the/g
flag;Or, suppose that the tokens have to be in order, but other things may be between them. I can insert a
.*?
to match that stuff, effectively skipping over it:It would be much nicer if I could manage all of this from the
map
where I create the patterns, but the/g
flag isn't a pattern flag. It has to go with the operator.I find it much easier to manage changing requirements when I don't wrap everything in a single regular expression.