What is the best algorithm for matching two string

2019-03-13 09:36发布

问题:

I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.

Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:

https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics

BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex

but I'm not well versed in these algorithms and what would be a good choice ?

I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.

Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.

回答1:

They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.

I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.

Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.

To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.

If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.

TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.



回答2:

You are likely need to solve a string-to-string correction problem. Levenshtein distance algorithm is implemented in many languages. Before running it I'd remove all spaces from string, because they don't contain any sensitive information, but may influence two strings difference. For string search prefix trees are also useful, you can have a look in this direction as well. For example here or here. Was already discussed on SO. If spaces are so much significant in your case, just assign a greater weight to them.



回答3:

Each algorithm is going to focus on a similar, but slightly different aspect of the two strings. Honestly, it depends entirely on what you are trying to accomplish. You say that the algorithm needs to understand words, but should it also understand interactions between those words? If not, you can just break up each string according to spaces, and compare each word in the first string to each word in the second. If they share a word, the commonality factor of the two strings would need to increase.

In this way, you could create your own algorithm that focused only on what you were concerned with. If you want to test another algorithm that someone else made, you can find examples online and run your data through to see how accurate the estimated commonality is with each.

I think http://jtmt.sourceforge.net/ would be a good place to start.



回答4:

Interesting. Have you thought about a radix sort?

http://en.wikipedia.org/wiki/Radix_sort

The concept behind the radix sort is that it is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits. If you convert your string into an array of characters, which will be a number no greater than 3 digits, then your k=3(maximum number of digits) and you n = number of string to compare. This will sort the first digits of all your strings. Then you will have another factor s=the length of the longest string. your worst case scenario for sorting would be 3*n*s and the best case would be (3 + n) * s. Check out some radix sort examples for strings here:

http://algs4.cs.princeton.edu/51radix/LSD.java.html

http://users.cis.fiu.edu/~weiss/dsaajava3/code/RadixSort.java



回答5:

Did you take a look at the levenshtein distance ?

int org.apache.commons.lang.StringUtils.getLevenshteinDistance(String s, String t)

Find the Levenshtein distance between two Strings.

This is the number of changes needed to change one String into another, where each change is a single character modification (deletion, insertion or substitution).

The previous implementation of the Levenshtein distance algorithm was from http://www.merriampark.com/ld.htm

Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError which can occur when my Java implementation is used with very large strings. This implementation of the Levenshtein distance algorithm is from http://www.merriampark.com/ldjava.htm

Anyway, I'm curious to know what do you choose in this case !



回答6:

I wonder if you are not trying to reinvent the wheel. Some database management systems offer services/features that are designed for this kind of tasks.

E.g. Oracle Text

EDIT:

Hold on! You are going to compare the titles of songs and you are NOT using a database? That's interesting. Why didn't you add it in your original question. Because over 90% of industrial applications use some kind of database. And it really does not matter which branch you are in: manufacturing, medical, distribution, entertainment, financial, ...

And even if you don't use a database yet, you should consider using one. These days there are all kind of dbms. They come in all flavors: relational, object oriented ; binary or xml ; embedded or stand-alone ; multifile or singlefile. To be honest, if you don't use a database it's hard to take you serious.

But if you are using an Oracle database to store your songs. Then Oracle Text is the best answer to solve your problem.

And if you use a database, then it makes perfect sense to let the dbms perform the calculations for you. This will almost always be faster than extracting the data first.

Why Oracle Text (for example) is superior to self-implemented algorithms: Oracle has a notion of "themes", for example it knows that the word "politics" is related to "elections". To do this it uses a knowledge base. (Just read the documentation, and you will be surprised). You would spend years to develop it from scratch.