I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR
, where R
could be A
or G
. Also, the script must allow x
number of mismatches. So this is my code
import regex
s = 'ACTGCTGAGTCGT'
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)
So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT
and ACTGC**TGA**GTCGT
and ACTGCTGAGT**CGT**
. The expected result should be like this:
['TGC', 'TGA', 'AGT', 'CGT']
But the output is
['TGC', 'TGA']
Even using re.findall, the code doesn't recognize the last substring. On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is
['TGC', 'TGA']
Is there another way to get all the substrings?