Is there an alternative to `difflib.get_close_matc

2019-05-13 18:19发布

问题:

I want to use something like difflib.get_close_matches but instead of the most similar strings, I would like to obtain the indexes (i.e. position in the list).

The indexes of the list are more flexible because one can relate the index to other data structures (related to the matched string).

For example, instead of:

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'hallo', 'Hallo']

I would like:

>>> difflib.get_close_matches('Hello', words)
[0, 1, 6] 

There doesn't seem to exist a parameter to obtain this result, is there an alternative to difflib.get_close_matches() that returns the indexes?


My research towards an alternative

I know I could use difflib.SequenceMatcher, and then compare the strings one-to-one with ratio (or quick_ratio). However, I am afraid that this would be very inefficient, because:

  1. I would have to create thousands of SequenceMatcher objects and compare them (I am expecting that get_close_matches avoid the use of the class):

    EDIT: False. I checked the source code of get_close_matches, it actually uses SequenceMatcher.

  2. there is no cutoff (I am guessing that there is an optimization that avoids the calculation of the ratio for all the string)

    EDIT: Partially False. The code is get_close_matches does not have any major optimizations, except it uses real_quick_ratio, quick_ratio and ratio alltogether. In any case I can easily copy the optimization into my own function. Also I didn't consider that the SequenceMatcher has methods to set the sequences: set_seq1, set_seq2, so at least I won't have to create an object each time.

  3. as far as I understand, all python libraries are C compiled and this would increase performance.

    EDIT: I am quite sure this is the case. The function is in the folder called cpython.

    EDIT: There is a small difference (p-value is 0.030198) between executing directly from difflib and copy the function in a file mydifflib.py.

    ipdb> timeit.repeat("gcm('hello', _vals)", setup="from difflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10)
    [13.230449825001415, 13.126462900007027, 12.965455356999882, 12.955717618009658, 13.066136312991148, 12.935014379996574, 13.082025538009475, 12.943519036009093, 13.149949093989562, 12.970130036002956]
    
    ipdb> timeit.repeat("gcm('hello', _vals)", setup="from mydifflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10)
    [13.363269686000422, 13.087718107010005, 13.112324478992377, 13.358293497993145, 13.283965317998081, 13.056695280989516, 13.021098569995956, 13.04310674899898, 13.024205000008806, 13.152750282009947]
    

Nevertheless it is not nearly as bad as I expected, I think I will proceed unless anybody know another library or alternative.

回答1:

I took the source code for get_close_matches, and modify it in order to return the indexes instead of the string values.

# mydifflib.py
from difflib import SequenceMatcher
from heapq import nlargest as _nlargest

def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6):
    """Use SequenceMatcher to return a list of the indexes of the best 
    "good enough" matches. word is a sequence for which close matches 
    are desired (typically a string).
    possibilities is a list of sequences against which to match word
    (typically a list of strings).
    Optional arg n (default 3) is the maximum number of close matches to
    return.  n must be > 0.
    Optional arg cutoff (default 0.6) is a float in [0, 1].  Possibilities
    that don't score at least that similar to word are ignored.
    """

    if not n >  0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    s = SequenceMatcher()
    s.set_seq2(word)
    for idx, x in enumerate(possibilities):
        s.set_seq1(x)
        if s.real_quick_ratio() >= cutoff and \
           s.quick_ratio() >= cutoff and \
           s.ratio() >= cutoff:
            result.append((s.ratio(), idx))

    # Move the best scorers to head of list
    result = _nlargest(n, result)

    # Strip scores for the best n matches
    return [x for score, x in result]

Usage

>>> from mydifflib import get_close_matches_indexes
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> get_close_matches_indexes('hello', words)
[0, 1, 6] 

Now, I can relate this indexes to associated data of the string without having to search back the strings.