Ambiguous substring with mismatches

2019-07-31 17:59发布

I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR, where R could be A or G. Also, the script must allow x number of mismatches. So this is my code

import regex

s = 'ACTGCTGAGTCGT'    
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)

So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT and ACTGC**TGA**GTCGT and ACTGCTGAGT**CGT**. The expected result should be like this:

['TGC', 'TGA', 'AGT', 'CGT']

But the output is

['TGC', 'TGA']

Even using re.findall, the code doesn't recognize the last substring. On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is

['TGC', 'TGA']

Is there another way to get all the substrings?

1条回答
啃猪蹄的小仙女
2楼-- · 2019-07-31 18:41

If I understand well, you are looking for all three letters substrings that match the pattern T[GA]T and you allow at worst one error, but I think the error you are looking for is only a character substitution since you never spoke about 2 letters results.

To obtain the expected result, you have to change {e<=1} to {s<=1} (or {s<2}) and to apply it to the whole pattern (and not only the last letter) enclosing it in a group (capturing or not capturing, like you want), otherwise the predicate {s<=1} is only linked to the last letter:

regex.findall(r'(T[AG]T){s<=1}', s, overlapped=True)
查看更多
登录 后发表回答