Given two strings text1
and text2
public SOMEUSABLERETURNTYPE Compare(string text1, string text2)
{
// DO SOMETHING HERE TO COMPARE
}
Examples:
First String: StackOverflow
Second String: StaqOverflow
Return: Similarity is 91%
The return can be in % or something like that.
First String: The simple text test
Second String: The complex text test
Return: The values can be considered equal
Any ideas? What is the best way to do this?
Here is some code I have written for a project I am working on. I need to know the Similarity Ratio of the strings and the Similarity Ratio based on words of the strings. This last one, I want to know both the Words Similarity Ratio of the smallest string(so if all words exist and match in the larger string the result will be 100%) and the Words Similarity Ratio of the larger string(which I call RealWordsRatio). I use the Levenshtein algorithm to find the distance. The code is unoptimised, so far, but it works as expected. I hope you find it useful.
Perl module Text::Phonetic has implementations of various algorithms.
There are various different ways of doing this. Have a look at the Wikipedia "String similarity measures" page for links to other pages with algorithms.
I don't think any of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.
I've just found another page which gives rather more options... in particular, the Soundex algorithm (Wikipedia) may be closer to what you're after.
If you want to compare phonetically, check out the Soundex and Metaphone algorithms: http://www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex
Levenshtein distance is probably what you're looking for.
I wrote a Double Metaphone implementation in C# a while back. You'll find it vastly superior to Soundex and the like.
Levenshtein distance has also been suggested, and it's a great algorithm for a lot of uses, but phonetic matching is not really what it does; it only seems that way sometimes because phonetically similar words are also usually spelled similarly. I did an analysis of various fuzzy matching algorithms which you might also find useful.