python regex match more than once per index of sea

2019-05-23 08:26发布

问题:

I'm looking for a way to make the finditer function of the python re module or the newer regex module to match all possible variations of a particular pattern, overlapping or otherwise. I am aware of using lookaheads to get matches without consuming the search string, but I still only get one regex per index, where I could get more than one.

The regex I am using is something like this:

(?=A{2}[BA]{1,6}A{2})

so in the string:

AABAABBAA

it should be able to match:

AABAA AABAABBAA AABBAA

but currently it will only match the last two of these. I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?

回答1:

I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?

The problem is twofold.

1. Regex engines will only match once at a character position.
2. There is not a regex construct of between lazy and greedy  
   it's either one or the other.  

Skipping problem 1. for the moment..,

Problem 2:
There could be a case where there is {1,6} 1,2,3,4,5 or 6 matches
of a construct (character) at a given position.

To solve that problem, you'd have to specify independent {1},{2},{3},{4},{5},{6}
as optional alternations at that position.
Clearly a range {1,6} is not going to work.

As far as a Range is concerned, it can be specified to find the
minimum amount by adding the lazy modifier as such {1,6}?
But this will only find the smallest amount it can, no more, no less.

Finally,

Problem 1:
When a regex engine matches, it always advances the current position forward
an amount equal to the length of the last match.
In the case of a matched zero-length assertion, it artificially increases
the position one character forward.


So, given these two problems, one can use these strengths/weaknesses to come
up with a workaround, and have to live with some side affects.

Workarounds:
Put all the possible alternatives at a position as assertions to be analyzed. Each match at a position, will contain a list of groups that hold a variation.
So, if you've matched 3 variations out of 6 possible variant groups, the groups with values will be the variants.

If none of the groups have values, no variants were found at that position.
No variants can happen because all of the assertions are optional.
To avoid unnecessarily matching at these specific positions, a final
conditional can be used to not report these. (i.e., (?(1)|(?(2)|(?!))) etc..).


Lets use your range example as an example.
We will use the conditional at the end to verify a group matched,
but it could be done without it.
_Note that using this range example caused an overlap with identical
values in the final match. This does not insure unique matches at
a position (the example following this shows how to avoid this).

 # (?=(A{2}[BA]{1,6}?A{2}))?(?=(A{2}[BA]{1,6}A{2}))?(?(1)|(?(2)|(?!)))

 (?=
      (                             # (1 start)
           A{2}
           [BA]{1,6}? 
           A{2} 
      )                             # (1 end)
 )?
 (?=
      (                             # (2 start)
           A{2}
           [BA]{1,6} 
           A{2} 
      )                             # (2 end)
 )?
 (?(1)
   |  (?(2)
        |  (?!)
      )
 )

Output:

 **  Grp 1 -  ( pos 0 , len 5 ) 
AABAA  
 **  Grp 2 -  ( pos 0 , len 9 ) 
AABAABBAA  

-------------

 **  Grp 1 -  ( pos 3 , len 6 ) 
AABBAA  
 **  Grp 2 -  ( pos 3 , len 6 ) 
AABBAA  

Same, but without the range problem.
Here, we explicitly define unique constructs.
Note the unique values at each position.

 # (?=(A{2}[BA]{1}A{2}))?(?=(A{2}[BA]{2}A{2}))?(?=(A{2}[BA]{3}A{2}))?(?=(A{2}[BA]{4}A{2}))?(?=(A{2}[BA]{5}A{2}))?(?=(A{2}[BA]{6}A{2}))?(?(1)|(?(2)|(?(3)|(?(4)|(?(5)|(?(6)|(?!)))))))

 (?=
      (                             # (1 start)
           A{2}
           [BA]{1} 
           A{2} 
      )                             # (1 end)
 )?
 (?=
      (                             # (2 start)
           A{2}
           [BA]{2} 
           A{2} 
      )                             # (2 end)
 )?
 (?=
      (                             # (3 start)
           A{2}
           [BA]{3} 
           A{2} 
      )                             # (3 end)
 )?
 (?=
      (                             # (4 start)
           A{2}
           [BA]{4} 
           A{2} 
      )                             # (4 end)
 )?
 (?=
      (                             # (5 start)
           A{2}
           [BA]{5} 
           A{2} 
      )                             # (5 end)
 )?
 (?=
      (                             # (6 start)
           A{2}
           [BA]{6} 
           A{2} 
      )                             # (6 end)
 )?

 (?(1)|(?(2)|(?(3)|(?(4)|(?(5)|(?(6)|(?!)))))))

Output:

 **  Grp 1 -  ( pos 0 , len 5 ) 
AABAA  
 **  Grp 2 -  NULL 
 **  Grp 3 -  NULL 
 **  Grp 4 -  NULL 
 **  Grp 5 -  ( pos 0 , len 9 ) 
AABAABBAA  
 **  Grp 6 -  NULL 

------------------

 **  Grp 1 -  NULL 
 **  Grp 2 -  ( pos 3 , len 6 ) 
AABBAA  
 **  Grp 3 -  NULL 
 **  Grp 4 -  NULL 
 **  Grp 5 -  NULL 
 **  Grp 6 -  NULL 

Finally, all you need to do is on each match, grab the capture groups
with values, and put them into an array.



回答2:

You won't be able to use finditer() for this, or any of the normal means of navigating matches. The only way I've managed it (in Java, not Python) is by manually iterating over every unique substring of the source string (determined by start position and length), applying the regex to each one in the normal way (not with lookarounds), and anchored at both ends.

Java's Matcher class provides a convenient way to do that with its region() method. It lets you pretend that any substring of the source is really the whole thing, without having to generate gobs of new String objects. And the matches() method automatically anchors the match at both ends, so you don't have to tinker with the regex at all.

I'm pretty sure Python's re module doesn't support such a feature, but there might be something in another module that you could use to similar effect (I'm not fluent in Python, so that's just optimism). I'm even less familiar with the regex module, but it seems to support all the sweetest features from every other flavor, so it's worth a look.

Here's my Java solution, in case you're interested:

public static void printAllMatches(String text, String regex)
{
  System.out.printf("%s results:%n", text);
  Matcher m = Pattern.compile(regex).matcher(text);
  int end = text.length();
  for (int i = 0; i < end; ++i)
  {
    for (int j = i + 1; j <= end; ++j) 
    {
      m.region(i, j);

      if (m.matches()) 
      {
        for (int k = 0; k < i; k++)
        {
          System.out.print(" ");
        }
        System.out.println(m.group());
      }   
    }   
  }   
}

The output:

AABAABBAA results:
AABAA
AABAABBAA
   AABBAA


回答3:

This regular expression matches correctly 3 strings :

AABAA AABAABBAA AABBAA

((^(AA){1}(BAA)$){1})|(((AA){1}(BB){1}(AA){1}$){1})

If you want search these substrings in string "AABAABBAA" with python you can use method "search" :

re.search('AABAABB','AABAABBAA')
re.search('AABAABBAA','AABAABBAA')
re.search('AABBAA','AABAABBAA')