I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR
, where R
could be A
or G
. Also, the script must allow x
number of mismatches. So this is my code
import regex
s = 'ACTGCTGAGTCGT'
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)
So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT
and ACTGC**TGA**GTCGT
and ACTGCTGAGT**CGT**
. The expected result should be like this:
['TGC', 'TGA', 'AGT', 'CGT']
But the output is
['TGC', 'TGA']
Even using re.findall, the code doesn't recognize the last substring. On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is
['TGC', 'TGA']
Is there another way to get all the substrings?
If I understand well, you are looking for all three letters substrings that match the pattern
T[GA]T
and you allow at worst one error, but I think the error you are looking for is only a character substitution since you never spoke about 2 letters results.To obtain the expected result, you have to change
{e<=1}
to{s<=1}
(or{s<2}
) and to apply it to the whole pattern (and not only the last letter) enclosing it in a group (capturing or not capturing, like you want), otherwise the predicate{s<=1}
is only linked to the last letter: