I want to use something like difflib.get_close_matches
but instead of the most similar strings, I would like to obtain the indexes (i.e. position in the list).
The indexes of the list are more flexible because one can relate the index to other data structures (related to the matched string).
For example, instead of:
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'hallo', 'Hallo']
I would like:
>>> difflib.get_close_matches('Hello', words)
[0, 1, 6]
There doesn't seem to exist a parameter to obtain this result, is there an alternative to difflib.get_close_matches()
that returns the indexes?
My research towards an alternative
I know I could use difflib.SequenceMatcher
, and then compare the strings one-to-one with ratio
(or quick_ratio
). However, I am afraid that this would be very inefficient, because:
I would have to create thousands of SequenceMatcher objects and compare them (I am expecting that
get_close_matches
avoid the use of the class):EDIT: False. I checked the source code of
get_close_matches
, it actually usesSequenceMatcher
.there is no cutoff (I am guessing that there is an optimization that avoids the calculation of the ratio for all the string)
EDIT: Partially False. The code is
get_close_matches
does not have any major optimizations, except it usesreal_quick_ratio
,quick_ratio
andratio
alltogether. In any case I can easily copy the optimization into my own function. Also I didn't consider that the SequenceMatcher has methods to set the sequences:set_seq1
,set_seq2
, so at least I won't have to create an object each time.as far as I understand, all python libraries are C compiled and this would increase performance.
EDIT: I am quite sure this is the case. The function is in the folder called cpython.
EDIT: There is a small difference (p-value is 0.030198) between executing directly from difflib and copy the function in a file mydifflib.py.
ipdb> timeit.repeat("gcm('hello', _vals)", setup="from difflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10) [13.230449825001415, 13.126462900007027, 12.965455356999882, 12.955717618009658, 13.066136312991148, 12.935014379996574, 13.082025538009475, 12.943519036009093, 13.149949093989562, 12.970130036002956] ipdb> timeit.repeat("gcm('hello', _vals)", setup="from mydifflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10) [13.363269686000422, 13.087718107010005, 13.112324478992377, 13.358293497993145, 13.283965317998081, 13.056695280989516, 13.021098569995956, 13.04310674899898, 13.024205000008806, 13.152750282009947]
Nevertheless it is not nearly as bad as I expected, I think I will proceed unless anybody know another library or alternative.