String regex two mismatches Python

2019-04-30 00:46发布

How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string?

Substring: SSQP

String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ

Here is an example where only one possible mismatch is incorporated:

>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
>>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP']

Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations.

How can I extend this code (or refactor this code) to explore the possibility of two mismatches?

Furthermore, I want to modify my output so that I get the numeric index returned (not SSQQ or SSQP) of the exact position the substring matched the string.

2条回答
做自己的国王
2楼-- · 2019-04-30 01:38

You don't have to use re here you can use itertools module instead and save a lot of memory.

You can first extract all sub-strings with length 4 then compare them with your substring and just select those that have less that 2 difference with your substring :

from itertools import izip,islice,tee

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            yield ''.join(next(new))

Demo:

s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'

substring='SSQP'
print list(sub_findre(s,substring,2))

['SSPQ', 'SPQQ', 'QQQP', 'SSSS', 'SSSQ', 'SSQQ', 'SQQQ', 'SSQP', 'PSQS', 'SSQP', 'SSQP', 'SQPP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQ']

If you want to return the indices you need to put the indices in izip which you can use itertools.repeat() to repeat the index with the length of substring :

from itertools import izip,islice,tee,repeat

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            next(new)
            yield next(new)[0]

Demo:

print list(sub_findre(s,substring,2))
[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]
查看更多
一夜七次
3楼-- · 2019-04-30 01:46

The combinatorial explosion is not that bad for two mismatches out of four.

First, observe that you can omit SSQP itself, since it's covered by all of the more lenient cases.

re.findall(r'(?=([A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)

So, the number of cases is

               4!
C(4, 1) = ––––––––––––– = 4
          1! * (4 - 1)!

For up to two mismatches, the number of cases is

               4!
C(4, 2) = ––––––––––––– = 6
          2! * (4 - 2)!

Namely,

re.findall('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)

(To simplify the illustration, I've taken the liberty of writing . instead of [A-Z].)


To get the positions of the matches instead of the text of the matches:

[match.start(0) for match in re.finditer('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)]
查看更多
登录 后发表回答