We have a simple SQL problem here. In a varchar column, we wanted to search for a string anywhere in the field. What is the best way to implement this for performance? Obviously an index is not going to help here, any other tricks?
We are using MySQL and have about 3 million records. We need to execute many of these queries per second so really trying to implement these with the best performance.
The most simple way to do this is so far is:
Select * from table where column like '%search%'
I should further specify that the column is actually a long string like "sadfasdfwerwe" and I have to search for "asdf" in this column. So they are not sentences and trying to match a word in them. Would full text search still help here?
Check out my presentation Practical Fulltext Search in MySQL.
I compared:
LIKE
predicates
- Regular expression predicates (no better than
LIKE
)
- MyISAM FULLTEXT indexing
- Sphinx Search
- Apache Lucene
- Inverted indexing
- Google Custom Search Engine
Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.
Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries
The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.
I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:
create table trigrams (
trigram char(3) primary key
);
create table trigram_matches (
trigram char(3),
document_id int,
primary key (trigram, document_id),
foreign key (trigram) references trigrams(trigram),
foreign key (document_id) references mytable(document_id)
);
Now populate it the hard way:
insert into trigram_matches
select t.trigram, d.document_id
from trigrams t join mytable d
on d.textcolumn like concat('%', t.trigram, '%');
Of course this will take quite a while! But once it's done, you can search much more quickly:
select d.*
from mytable d join trigram_matches t
on t.document_id = d.document_id
where t.trigram = 'abc'
Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:
select d.*
from mytable d join trigram_matches t
on t.document_id = d.document_id
where t.trigram = 'abc'
and d.textcolumn like '%abcdef%';
I you want to match whole words, look at a FULLTEXT
index & MATCH() AGAINST()
. And of course, take a load of your database server: cache results for a appropriate amount of time for you specific needs.
First, maybe this is an issue with a badly designed table that stores a delimited string in one field instead of correctly designing to make a related table. If this is the case, you should fix your design.
If you have a field with long descriptive text (saya a notes field) and the search is always by whole word, you can do a full-text search.
Consider if you can require your users to at least give you the first character of what they are searching for if it is an ordinary field like Last_name.
Consider doing an exact match search first and only performing the wildcard match if no results are returned. This will work if you have users who can provide exact matches. We did this once with airport name searches, it came back really fast if they put inthe exact name and slower if they did not.
If you want to search just for strings that are not words that may be somewhere in the text, you are pretty much stuck with bad performance.
mysql fulltext search's quality (for this purpose) is poor, if your language is not English
trigram search gives very good results, for this task
postgreSQL has trigram index, it's easy to use :)
but if you need to do it in mysql, try this, improved version of Bill Karwin's answer:
-each trigram is stored only once
-a simple php class uses the data
<?php
/*
# mysql table structure
CREATE TABLE `trigram2content` (
`trigram_id` int NOT NULL REFERENCES trigrams(id),
`content_type_id` int(11) NOT NULL,
`record_id` int(11) NOT NULL,
PRIMARY KEY (`content_type_id`,`trigram_id`,`record_id`)
);
#each trigram is stored only once
CREATE TABLE `trigrams` (
`id` int not null auto_increment,
`token` varchar(3) NOT NULL,
PRIMARY KEY (id),
UNIQUE token(token)
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
SELECT count(*), record_id FROM trigrams t
inner join trigram2content c ON t.id=c.trigram_id
WHERE (
t.token IN ('loc','ock','ck ','blo',' bl', ' bu', 'bur', 'urn')
AND c.content_type_id = 0
)
GROUP by record_id
ORDER BY count(*) DESC
limit 20;
*/
class trigram
{
private $dbLink;
var $types = array(
array(0, 'name'),
array(1, 'city'));
function trigram()
{
//connect to db
$this->dbLink = mysql_connect("localhost", "username", "password");
if ($this->dbLink) mysql_select_db("dbname");
else mysql_error();
mysql_query("SET NAMES utf8;", $this->dbLink);
}
function get_type_value($type_name){
for($i=0; $i<count($this->types); $i++){
if($this->types[$i][1] == $type_name)
return $this->types[$i][0];
}
return "";
}
function getNgrams($word, $n = 3) {
$ngrams = array();
$len = mb_strlen($word, 'utf-8');
for($i = 0; $i < $len-($n-1); $i++) {
$ngrams[] = mysql_real_escape_string(mb_substr($word, $i, $n, 'utf-8'), $this->dbLink);
}
return $ngrams;
}
/**
input: array('hel', 'ell', 'llo', 'lo ', 'o B', ' Be', 'Bel', 'ell', 'llo', 'lo ', 'o ')
output: array(1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 8)
*/
private function getTrigramIds(&$t){
$u = array_unique($t);
$q = "SELECT * FROM trigrams WHERE token IN ('" . implode("', '", $u) . "')";
$query = mysql_query($q, $this->dbLink);
$n = mysql_num_rows($query);
$ids = array(); //these trigrams are already in db, they have id
$ok = array();
for ($i=0; $i<$n; $i++)
{
$row = mysql_fetch_array($query, MYSQL_ASSOC);
$ok []= $row['token'];
$ids[ $row['token'] ] = $row['id'];
}
$diff = array_diff($u, $ok); //these trigrams are not yet in the db
foreach($diff as $n){
mysql_query("INSERT INTO trigrams (token) VALUES('$n')", $this->dbLink);
$ids[$n]= mysql_insert_id();
}
//so many ids than items (if a trigram occurs more times in input, then it will occur more times in output as well)
$result = array();
foreach($t as $n){
$result[]= $ids[$n];
}
return $result;
}
function insertData($id, $data, $type){
$t = $this->getNgrams($data);
$id = intval($id);
$type = $this->get_type_value($type);
$tIds = $this->getTrigramIds($t);
$q = "INSERT INTO trigram2content (trigram_id, content_type_id, record_id) VALUES ";
$rows = array();
foreach($tIds as $n => $tid){
$rows[]= "($tid, $type, $id)";
}
$q .= implode(", ", $rows);
mysql_query($q, $this->dbLink);
}
function updateData($id, $data, $type){
mysql_query("DELETE FROM trigram2content WHERE record_id=".intval($id)." AND content_type_id=".$this->get_type_value($type), $this->dbLink);
$this->insertData($id, $data, $type);
}
function search($str, $type){
$tri = $this->getNgrams($str);
$max = count($tri);
$q = "SELECT count(*), count(*)/$max as score, record_id FROM trigrams t inner join trigram2content c ON t.id=c.trigram_id
WHERE (
t.token IN ('" . implode("', '", $tri) . "')
AND c.content_type_id = ".$this->get_type_value($type)."
)
GROUP by record_id
HAVING score >= 0.6
ORDER BY count(*) DESC
limit 20;";
$query = mysql_query($q, $this->dbLink);
$n = mysql_num_rows($query);
$result = array();
for ($i=0; $i<$n; $i++)
{
$row = mysql_fetch_array($query, MYSQL_ASSOC);
$result[] = $row;
}
return $result;
}
};
and usage:
$t = new trigram();
$t->insertData(1, "hello bello", "name");
$t->insertData(2, "hellllo Mammmma mia", "name");
print_r($t->search("helo", "name"));