String regex two mismatches Python

How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string?

Substring: SSQP

String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ

Here is an example where only one possible mismatch is incorporated:

>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
>>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP']

Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations.

How can I extend this code (or refactor this code) to explore the possibility of two mismatches?

Furthermore, I want to modify my output so that I get the numeric index returned (not SSQQ or SSQP) of the exact position the substring matched the string.

标签： python regex string-matching

2条回答

做自己的国王

2楼-- · 2019-04-30 01:38

You don't have to use re here you can use itertools module instead and save a lot of memory.

You can first extract all sub-strings with length 4 then compare them with your substring and just select those that have less that 2 difference with your substring :

from itertools import izip,islice,tee

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            yield ''.join(next(new))

Demo:

s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'

substring='SSQP'
print list(sub_findre(s,substring,2))

['SSPQ', 'SPQQ', 'QQQP', 'SSSS', 'SSSQ', 'SSQQ', 'SQQQ', 'SSQP', 'PSQS', 'SSQP', 'SSQP', 'SQPP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQ']

If you want to return the indices you need to put the indices in izip which you can use itertools.repeat() to repeat the index with the length of substring :

from itertools import izip,islice,tee,repeat

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            next(new)
            yield next(new)[0]

Demo:

print list(sub_findre(s,substring,2))
[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]

0人赞添加讨论(0) 举报

一夜七次

3楼-- · 2019-04-30 01:46

The combinatorial explosion is not that bad for two mismatches out of four.

First, observe that you can omit SSQP itself, since it's covered by all of the more lenient cases.

re.findall(r'(?=([A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)

So, the number of cases is

               4!
C(4, 1) = ––––––––––––– = 4
          1! * (4 - 1)!

For up to two mismatches, the number of cases is

               4!
C(4, 2) = ––––––––––––– = 6
          2! * (4 - 2)!

Namely,

re.findall('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)

(To simplify the illustration, I've taken the liberty of writing . instead of [A-Z].)

To get the positions of the matches instead of the text of the matches:

[match.start(0) for match in re.finditer('(?=(SS..|S.Q.|S..P|.SQ.|.S.P|..QP))', s)]

0人赞添加讨论(0) 举报

String regex two mismatches Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间