I'm looking for a way to make the finditer function of the python re module or the newer regex module to match all possible variations of a particular pattern, overlapping or otherwise. I am aware of using lookaheads to get matches without consuming the search string, but I still only get one regex per index, where I could get more than one.
The regex I am using is something like this:
(?=A{2}[BA]{1,6}A{2})
so in the string:
AABAABBAA
it should be able to match:
AABAA
AABAABBAA
AABBAA
but currently it will only match the last two of these. I realise it is to do with the greediness of the [BA]{1,6}
. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?
I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?
The problem is twofold.
Skipping problem 1. for the moment..,
Problem 2:
There could be a case where there is
{1,6}
1,2,3,4,5 or 6 matchesof a construct (character) at a given position.
To solve that problem, you'd have to specify independent {1},{2},{3},{4},{5},{6}
as optional alternations at that position.
Clearly a range
{1,6}
is not going to work.As far as a Range is concerned, it can be specified to find the
minimum amount by adding the lazy modifier as such
{1,6}?
But this will only find the smallest amount it can, no more, no less.
Finally,
Problem 1:
When a regex engine matches, it always advances the current position forward
an amount equal to the length of the last match.
In the case of a matched zero-length assertion, it artificially increases
the position one character forward.
So, given these two problems, one can use these strengths/weaknesses to come
up with a workaround, and have to live with some side affects.
Workarounds:
Put all the possible alternatives at a position as assertions to be analyzed. Each match at a position, will contain a list of groups that hold a variation.
So, if you've matched 3 variations out of 6 possible variant groups, the groups with values will be the variants.
If none of the groups have values, no variants were found at that position.
No variants can happen because all of the assertions are optional.
To avoid unnecessarily matching at these specific positions, a final
conditional can be used to not report these. (i.e.,
(?(1)|(?(2)|(?!)))
etc..).Lets use your range example as an example.
We will use the conditional at the end to verify a group matched,
but it could be done without it.
_Note that using this range example caused an overlap with identical
values in the final match. This does not insure unique matches at
a position (the example following this shows how to avoid this).
Output:
Same, but without the range problem.
Here, we explicitly define unique constructs.
Note the unique values at each position.
Output:
Finally, all you need to do is on each match, grab the capture groups
with values, and put them into an array.
This regular expression matches correctly 3 strings :
If you want search these substrings in string "AABAABBAA" with python you can use method "search" :
You won't be able to use
finditer()
for this, or any of the normal means of navigating matches. The only way I've managed it (in Java, not Python) is by manually iterating over every unique substring of the source string (determined by start position and length), applying the regex to each one in the normal way (not with lookarounds), and anchored at both ends.Java's Matcher class provides a convenient way to do that with its
region()
method. It lets you pretend that any substring of the source is really the whole thing, without having to generate gobs of new String objects. And thematches()
method automatically anchors the match at both ends, so you don't have to tinker with the regex at all.I'm pretty sure Python's
re
module doesn't support such a feature, but there might be something in another module that you could use to similar effect (I'm not fluent in Python, so that's just optimism). I'm even less familiar with theregex
module, but it seems to support all the sweetest features from every other flavor, so it's worth a look.Here's my Java solution, in case you're interested:
The output: