I would like to approximately match Strings using Locality sensitive hashing. I have many Strings>10M that may contain typos. For every String I would like to make a comparison with all the other strings and select those with an edit distance according to some threshold.
That is, the naive solution requires O(n^2) comparisons. In order to avoid that issue I was thinking of using Locality Sensitive Hashing. Then near similar strings would result to the same buckets and I need to do only inside bucket search. So it is O(n*C) where C is the bucket size.
However, I do not understand how to represent the strings. If it was text I would represented in vector space. My main question is if this is tractable using LSH and then an appropriate vector representation of the string.
Am I able to use an already implemented library for this task? or it depends on my problem so I must implement it myself? Is there any python package that does this?
The best academic resource I've found on the subject is Chapter 3 of Mining of Massive Datasets, which gives an awesome overview of locality sensitive hashing and minhashing.
Very briefly, the idea is to take several strings, vectorize those strings, then pass a sliding window over the resulting vectors. If two vectors have the same value in the same window position, mark them as candidates for more fine-grained similarity analysis.
There's a great implementation in the Python datasketch library (
pip install datasketch
). Here's an example that shows you can catch fuzzy string similarity:This returns: