We have a simple SQL problem here. In a varchar column, we wanted to search for a string anywhere in the field. What is the best way to implement this for performance? Obviously an index is not going to help here, any other tricks?
We are using MySQL and have about 3 million records. We need to execute many of these queries per second so really trying to implement these with the best performance.
The most simple way to do this is so far is:
Select * from table where column like '%search%'
I should further specify that the column is actually a long string like "sadfasdfwerwe" and I have to search for "asdf" in this column. So they are not sentences and trying to match a word in them. Would full text search still help here?
Check out my presentation Practical Fulltext Search in MySQL.
I compared:
LIKE
predicatesLIKE
)Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.
Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries
The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.
I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:
Now populate it the hard way:
Of course this will take quite a while! But once it's done, you can search much more quickly:
Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:
mysql fulltext search's quality (for this purpose) is poor, if your language is not English
trigram search gives very good results, for this task
postgreSQL has trigram index, it's easy to use :)
but if you need to do it in mysql, try this, improved version of Bill Karwin's answer:
-each trigram is stored only once
-a simple php class uses the data
and usage:
I you want to match whole words, look at a
FULLTEXT
index &MATCH() AGAINST()
. And of course, take a load of your database server: cache results for a appropriate amount of time for you specific needs.First, maybe this is an issue with a badly designed table that stores a delimited string in one field instead of correctly designing to make a related table. If this is the case, you should fix your design.
If you have a field with long descriptive text (saya a notes field) and the search is always by whole word, you can do a full-text search.
Consider if you can require your users to at least give you the first character of what they are searching for if it is an ordinary field like Last_name.
Consider doing an exact match search first and only performing the wildcard match if no results are returned. This will work if you have users who can provide exact matches. We did this once with airport name searches, it came back really fast if they put inthe exact name and slower if they did not.
If you want to search just for strings that are not words that may be somewhere in the text, you are pretty much stuck with bad performance.