How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string?
Substring: SSQP
String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ
Here is an example where only one possible mismatch is incorporated:
>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
>>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP']
Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations.
How can I extend this code (or refactor this code) to explore the possibility of two mismatches?
Furthermore, I want to modify my output so that I get the numeric index returned (not SSQQ
or SSQP
) of the exact position the substring matched the string.
You don't have to use
re
here you can useitertools
module instead and save a lot of memory.You can first extract all sub-strings with length 4 then compare them with your
substring
and just select those that have less that 2 difference with yoursubstring
:Demo:
If you want to return the indices you need to put the indices in
izip
which you can useitertools.repeat()
to repeat the index with the length ofsubstring
:Demo:
The combinatorial explosion is not that bad for two mismatches out of four.
First, observe that you can omit
SSQP
itself, since it's covered by all of the more lenient cases.So, the number of cases is
For up to two mismatches, the number of cases is
Namely,
(To simplify the illustration, I've taken the liberty of writing
.
instead of[A-Z]
.)To get the positions of the matches instead of the text of the matches: