PostgreSQL: Find sentences closest to a given sent

2019-07-20 06:45发布

问题:

I have a table of images with sentence captions. Given a new sentence I want to find the images that best match it based on how close the new sentence is to the stored old sentences.

I know that I can use the @@ operator with a to_tsquery but tsquery accepts specific words as queries.

One problem is I don't know how to convert the given sentence into a meaningful query. The sentence may have punctuation and numbers.

However, I also feel that some kind of cosine similarity thing is what I need but I don't know how to get that out of PostgresQL. I am using the latest GA version and am happy to use the development version if that would solve my problem.

回答1:

Full Text Search (FTS)

You could use plainto_tsquery() to (per documentation) ...

produce tsquery ignoring punctuation

SELECT plainto_tsquery('english', 'Sentence: with irrelevant words (and punctuation) in it.')

 plainto_tsquery
------------------
 'sentenc' & 'irrelev' & 'word' & 'punctuat'

Use it like:

SELECT *
FROM   tbl
WHERE  to_tsvector('english', sentence) @@ plainto_tsquery('english', 'My new sentence');

But that is still rather strict and only provides very limited tolerance for similarity.

Trigram similarity

Might be better suited to search for similarity, even overcome typos to some degree.

Install the additional module pg_trgm, create a GiST index and use the similarity operator % in a nearest neighbour search:

Basically, with a trigram GiST index on sentence:

-- SELECT set_limit(0.3);  -- adjust tolerance if needed

SELECT *
FROM   tbl
WHERE  sentence % 'My new sentence'
ORDER  BY sentence <-> 'My new sentence'
LIMIT  10;

More:

  • Finding similar strings with PostgreSQL quickly
  • Finding similar posts with PostgreSQL
  • Slow fulltext search for terms with high occurence

Combine both

You can even combine FTS and trigram similarity:

  • PostgreSQL FTS and Trigram-similarity Query Optimization


回答2:

it's a pretty late answer, but I'm adding in case anyone encounters. If you add ": *" to the end of the words, it will bring up similar ones. Sample: JS autocomlete -> Codeigniter:

barcode = $ this-> input-> get ("term"). ":*";

Query: $ query = 'select * from tablaneme where xx @@? LIMIT 15 '; $ barcodequery = $ this-> db-> query ($ query, array (explode ("", $ barcode)))) -> result_array ();