Here's a puzzle...
I have two databases of the same 50000+ electronic products and I want to match products in one database to those in the other. However, the product names are not always identical. I've tried using the Levenshtein distance for measuring the string similarity however this hasn't worked. For example,
-LG 42CS560 42-Inch 1080p 60Hz LCD HDTV
-LG 42 Inch 1080p LCD HDTV
These items are the same, yet their product names vary quite a lot.
On the other hand...
-LG 42 Inch 1080p LCD HDTV
-LG 50 Inch 1080p LCD HDTV
These are different products with very similar product names.
How should I tackle this problem?
My first thought is to try to parse the names into a description of features (company LG
, size 42 Inch
, resolution 1080p
, type LCD HDTV
). Then you can match these descriptions against each other for compatibility; it's okay to omit a product number but bad to have different sizes. Simple are-the-common-attributes-compatible might be enough, or you might have to write / learn rules about how much different attributes are allowed to differ and so on.
Depending on how many different kinds of products you have and how different the listed names are, I might actually start by manually defining a set of attributes and possibly even just adding specific words / regexes to match them, iteratively seeing what isn't been parsed so far and adding rules for that. I'd imagine there's not a lot of ambiguity in terms of one vocabulary item possibly belonging to multiple attributes, though without seeing your database I guess I don't know.
If that's not going to be feasible, this extraction is kind of analogous to semi-supervised part-of-speech tagging. It's somewhat different, though, in that I imagine the vocabulary is much more limited than typical parsing, and in that the space of product names is more heirarchical: the resolution
tag only applies to certain kinds of products. I'm not very familiar with that literature; there might be some ideas you could use.
Use a large set of training examples. For each possible pair in this example set:
- Parse the string for its components, viz. company, size_desc, display_type, make and so on.
- Find the distance between the same components between the two strings of a pair.
- Create a tuple of numbers representing the distance between the components.
- Label the tuple as identical/non-identical based on the strings in the pair as part of the training set.
- Feed the tuples and train a binary classifier (SVM).
Now, when you get a pair of strings for which you want to decide if they are same or not, extract the features like you did in the training set and create the tuple of numbers for the distance between the various components of the string. Feed the tuple to the trained SVM and classify if they are same or not.
The advantage of using a learning approach like this is that you don't have to keep modifying the rules over and over again, and also the system learns the differences between a large pair of products that are same and different.
You could use LibSVM package in WEKA to do this.
I don't know that much about machine learning, but I do know Levenshtein distance is not the best approach for this type of problem.
I am working on an extremely similar problem currently, and have found much more accurate matches using Largest Consecutive Sub Sequence (https://www.geeksforgeeks.org/longest-consecutive-subsequence).
You may also find Longest Common Substring helpful too (https://www.geeksforgeeks.org/longest-common-substring-dp-29/).
... Or maybe even a combination of both!
Levenshtein is not great because it allows for substitutions, which can easily discount similar strings which have extra characters.
For example, "Hello AAAAAA", "Hello", and "BBBBB".
"Hello" and "BBBBB" are closer by Levenshtein distance, even though you would probably like "Hello" to match with "Hello AAAAAA".
LCS and LSS do not allow substitutions, so with both of these methods, "Hello" would match with "Hello AAAAAA".