Python regex module fuzzy match: substitution coun

2019-07-23 23:23发布

问题:

The python module regex allows fuzzy matching

You can specify the allowable number of substitutions,s, insertions,i, deletions,d, and total errors,e, allowed

The fuzzy_counts 'option' returns a tuple (0,0,0), where: 
match.fuzzy_counts[0] = the counts for 's', 
match.fuzzy_counts[1] = counts for 'i' and 
match.fuzzy_counts[2] = counts for 'd'

The deletions and insertions are counted as expected, but not the substitutions

In the example below, the only change is a single character deleted in the query, yet the substitutions count is 6 (7 if the BESTMATCH option is removed)

How are the substitutions counted?

I would be grateful of someone can anyone explain how this works to me

`import regex
reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<7,i<3,d<3,e<8}"
query = "TATGGACCAAAGTCTCAAGCCATGTG" 
match = regex.search(reference, query, regex.BESTMATCH)
print(match.fuzzy_counts)
(6,0,1)`

Incidentally, match.fuzzy_counts may have been what this post was after: Python "regex" module: Fuzziness value

回答1:

The issue seems to be related to the value in the allowed error setting.

Reducing the s to s < 3 changes the fuzzy match tuple score downwards:

reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<3,i<3,d<3,e<4}" 
query = "TATGGACCAAAGTCTCAAGCCATGTG"  
match = regex.search(reference, query, regex.BESTMATCH)
print(match.fuzzy_counts) 
(1,0,1)

reducing the allowed error for 's' even further returns the expected 's' score for this match:

reference = "(TATGGGA[CT][GC]AAAG[CT]CT[AC]AA[GA]CCATGTG){s<2,i<3,d<3,e<4}"
query = "TATGGACCAAAGTCTCAAGCCATGTG" 
match = regex.search(reference, query, regex.BESTMATCH)
print(match.fuzzy_counts)
(0,0,1)

Why it behaves in this way is still a mystery to me.