How do I query for records ordered by similarity?
Eg. searching for "Stock Overflow" would return
- Stack Overflow
- SharePoint Overflow
- Math Overflow
- Politic Overflow
- VFX Overflow
Eg. searching for "LO" would return:
- pabLO picasso
- michelangeLO
- jackson polLOck
What I need help with:
Using a search engine to index & search a MySQL table, for better results
Using full-text indexing, to find similar/containing strings
What does not work well
- Levenshtein distance is very erratic. (UDF, Query)
Searching for "dog" gives me:- dog
- bog
- ago
- big
- echo
LIKE
returns better results, but returns nothing for long queries although similar strings do exist- dog
- dogid
- dogaral
- dogma
1. Similarity
For Levenshtein in MySQL I found this, from
www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function2. Containing, case insensitive
Use the
LIKE
statement of MySQL, which is case insensitive by default. The%
is a wildcard, so there may be any string before and aftersearch_string
.3. Containing, case sensitive
The MySQL Manual helps:
My MySQL setup does not support
latin1_general_cs
orlatin1_bin
, but it worked fine for me to use the collationutf8_bin
as binary utf8 is case sensitive:2. / 3. sorted by Levenshtein Distance
I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:
You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).
Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :
Use a data generator to generate some random data if you don't want to bother creating it yourself...
** NOTE ** : the column type should be
latin1_bin
to perform a case sensitive search instead of case insensitive withlatin1
. For unicode strings, I would recommendutf8_bin
for case sensitive andutf8_general_ci
for case insensitive searches.Read more about it from the MySQL API reference page
The downside to this is that it will not look for letter transposition or "similar, sounds like" words.
** UPDATE **
Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:
Then, this is basically how you search (basic search) :
Here are great sites about Lucene in Java, PHP, and .Net.
In conclusion each search methods have their own pros and cons :
Please feel free to comment if I have forgotten/missed anything.
It seems that your definition of similarity is semantic similarity. So in order to build such a similarity function, you should use semantic similarity measures. Note that the scope of work on the issue might vary from few hours to years so it is recommended to decide on the scope before getting into work. I didn’t figure out which data do you have in order to build the similarity relation. I assume the you have access the a dataset of documents and a dataset of queries. You can start with co-occurrence of the words (e.g., conditional probability). You will discover quickly that you get the list of stop words as related the most of the words simply because they are very popular. Using the lift of conditional probability will take care of the stop words but will make the relation prone to error in small number (most of your cases). You might try Jacard but since it is symmetric there will be many relations it won't find. Then you might consider relations that appear only in short distance from the base word. You can (and should) consider relations base on general corpus's (e.g., Wikipedia) and user specific (e.g., his emails).
Very shortly you will have plenty of similarity measures, when all the measures are good and have some advantage over the others.
In order to combine such measures, I like to reduce the problem into a classification problem.
You should build a data set of paris of words and label them as "is related". In order to build a large labeled dataset you can:
Then use all the measures you have as features of the pairs. Now you are in the domain of supervised classification problem. Build a classifier on the data set, evaluated according to your needs and get a similarity measure that fits your needs.