I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences.
I know that I can use the @@
operator with a to_tsquery
but tsquery
accepts specific words as queries.
One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers.
However, I also feel that some kind of cosine similarity thing is what I need but I don't know how to get that out of PostgresQL. I am using the latest GA version and am happy to use the development version if that would solve my problem.
Full Text Search (FTS)
You could use plainto_tsquery()
to (per documentation) ...
produce tsquery
ignoring punctuation
SELECT plainto_tsquery('english', 'Sentence: with irrelevant words (and punctuation) in it.')
plainto_tsquery
------------------
'sentenc' & 'irrelev' & 'word' & 'punctuat'
Use it like:
SELECT *
FROM tbl
WHERE to_tsvector('english', sentence) @@ plainto_tsquery('english', 'My new sentence');
But that is still rather strict and only provides very limited tolerance for similarity.
Trigram similarity
Might be better suited to search for similarity, even overcome typos to some degree.
Install the additional module pg_trgm
, create a GiST index and use the similarity operator %
in a nearest neighbour search:
Basically, with a trigram GiST index on sentence
:
-- SELECT set_limit(0.3); -- adjust tolerance if needed
SELECT *
FROM tbl
WHERE sentence % 'My new sentence'
ORDER BY sentence <-> 'My new sentence'
LIMIT 10;
More:
- Finding similar strings with PostgreSQL quickly
- Finding similar posts with PostgreSQL
- Slow fulltext search for terms with high occurence
Combine both
You can even combine FTS and trigram similarity:
- PostgreSQL FTS and Trigram-similarity Query Optimization
it's a pretty late answer, but I'm adding in case anyone encounters. If you add ": *" to the end of the words, it will bring up similar ones.
Sample:
JS autocomlete -> Codeigniter:
barcode = $ this-> input-> get ("term"). ":*";
Query:
$ query = 'select * from tablaneme where xx @@? LIMIT 15 ';
$ barcodequery = $ this-> db-> query ($ query, array (explode ("", $ barcode)))) -> result_array ();