SQL query to find rows with the most matching keyw

2019-07-14 01:49发布

I'm really bad at SQL and I would like to know what SQL I can run to solve the problem below which I suspect to be a NP-Complete problem but I'm ok with the query taking a long time to run over large datasets as this will be done as a background task. A standard sql statement is preferred but if a stored procedure is required then so be it. The SQL is required to run on Postgres 9.3.

Problem: Given a set of articles that contain a set of keywords, find the top n articles for each article that contains the most number of matching keywords.

A trimmed down version of the article table looks like this:

CREATE TABLE article (
  id character varying(36) NOT NULL,  -- primary key of article
  keywords character varying,         -- comma separated set of keywords

  CONSTRAINT pk_article PRIMARY KEY (id)
);

-- Test Data
INSERT INTO article(id, keywords) VALUES(0, 'red,green,blue');
INSERT INTO article(id, keywords) VALUES(1, 'red,green,yellow');
INSERT INTO article(id, keywords) VALUES(2, 'purple,orange,blue');
INSERT INTO article(id, keywords) VALUES(3, 'lime,violet,ruby,teal');
INSERT INTO article(id, keywords) VALUES(4, 'red,green,blue,yellow');
INSERT INTO article(id, keywords) VALUES(5, 'yellow,brown,black');
INSERT INTO article(id, keywords) VALUES(6, 'black,white,blue');

Which would result in this for a SELECT * FROM article; query:

Table: article
------------------------
id  keywords            
------------------------
0   red,green,blue      
1   red,green,yellow    
2   purple,orange,blue  
3   lime,violet,ruby,teal
4   red,green,blue,yellow
5   yellow,brown,black
6   black,white,blue

Assuming I want to find the top 3 articles for each article that contains the most number of matching keywords then the output should be this:

------------------------
id  related
------------------------
0   4,1,6
1   4,0,5
2   0,4,6
3   null
4   0,1,6
5   1,6
6   5,0,4

2条回答
时光不老,我们不散
2楼-- · 2019-07-14 02:24

You can store lists in comma-separated strings. No problem, as long as this is just a string for you and you are not interested in its separate values. As soon as you are interested in the separate values, as in your example, store them separately.

This said, correct your database design and only then think about the query.

The following query selects all ID pairs first and counts common keywords. It then ranks the pairs by giving the other ID with the most keywords in common rank #1, etc. Then you keep only the three best matching IDs. STRING_AGG lists the best matching IDs in a string ordered by the number of keywords in common.

select 
  this_article as id,
  string_agg(other_article, ',' order by rn) as related
from
(
  select 
    this_article, 
    other_article, 
    row_number() over (partition by this_article order by cnt_common desc) as rn
  from
  (
    select 
      this.id as this_article, 
      other.id as other_article, 
      count(other.id) as cnt_common
    from keywords this
    left join keywords other on other.keyword = this.keyword and other.id <> this.id
    group by this.id, other.id
  ) pairs
) ranked
where rn <= 3
group by this_article
order by this_article;

Here is the SQL fiddle: http://sqlfiddle.com/#!15/1d20c/9.

查看更多
三岁会撩人
3楼-- · 2019-07-14 02:29

Like @a_horse commented: This would be simpler with a normalized design (besides making other tasks simpler/ cleaner), but still not trivial.

Also, a PK column of data type character varying(36) is highly suspicious (and inefficient) and should most probably be an integer type or at least a uuid instead.

Here is one possible solution based on your design as is:

WITH cte AS (
   SELECT id, string_to_array(a.keywords, ',') AS keys
   FROM   article a
   )
SELECT id, string_agg(b_id, ',') AS best_matches
FROM  (
   SELECT a.id, b.id AS b_id
        , row_number() OVER (PARTITION BY a.id ORDER BY ct.ct DESC, b.id) AS rn
   FROM   cte a
   LEFT   JOIN cte b ON a.id <> b.id AND a.keys && b.keys
   LEFT   JOIN LATERAL (
      SELECT count(*) AS ct
      FROM  (
         SELECT * FROM unnest(a.keys)
         INTERSECT ALL
         SELECT * FROM unnest(b.keys)
         ) i
      ) ct ON TRUE
   ORDER  BY a.id, ct.ct DESC, b.id  -- b.id as tiebreaker
   ) sub
WHERE  rn < 4
GROUP  BY 1;

SQL Fiddle (using an integer id instead).

The CTE cte converts the string into an array. You could even have a functional GIN index like that ...

If multiple rows tie for the top 3 picks, you need to define a tiebreaker. In my example, rows with smaller id come first.

Detailed explanation in this recent related answer:

To make this fast, you should at least have a GIN index on the array column (instead of the comma-separated string) and the query wouldn't need the CTE step. A completely normalized design has other advantages, but won't necessarily be faster than an array with GIN index.

查看更多
登录 后发表回答