Effective query merging more than 2 subqueries

I have a database with

books          (primary key: bookID)
characterNames (foreign key: books.bookID) 
locations      (foreign key: books.bookID)

The in-text-position of character names and locations are saved in the corresponding tables.
I'm writing a Pythonscript using psycopg2, finding all occurences of given character names and locations in books. I only want the occurences in books, where both the character name AND the location are found.
Here I already got a solution for searching one location and one character:

WITH b AS (  
    SELECT bookid  
    FROM   characternames  
    WHERE  name = 'XXX'  
    GROUP  BY 1  
    INTERSECT  
    SELECT bookid  
    FROM   locations  
    WHERE  l.locname = 'YYY'  
    GROUP  BY 1  
    )  
SELECT bookid, position, 'char' AS what  
FROM   b  
JOIN   characternames USING (bookid)  
WHERE  name = 'XXX'  
UNION  ALL  
SELECT bookid, position, 'loc' AS what  
FROM   b  
JOIN   locations USING (bookid)  
WHERE  locname = 'YYY'  
ORDER  BY bookid, position;

The CTE 'b' contains all bookid's, where the character name 'XXX' and the location 'YYY' appear.

Now I'm additionally wondering about searching for 2 places and a name (or 2 names and a place respectively). It's simple if all searched entities must occur in one book, but what about this:
Searching for: Tim, Al, Toolshop Results: books including
(Tim, Al, Toolshop) or
(Tim, Al) or
(Tim, Toolshop) or
(Al, Toolshop)

The problem could be repeated for 4, 5, 6...conditions.
I thougt about INTERSECTing more subqueries, but that wouldn't work.
Instead I would UNION the found bookIDs, GROUP them and select bookid's occurring more then once:

WITH b AS (  
    SELECT bookid, count(bookid) AS occurrences  
    FROM  
        (SELECT DISTINCT bookid  
        FROM characterNames  
        WHERE name='XXX'  
        UNION  
        SELECT DISTINCT bookid  
        FROM characterNames  
        WHERE name='YYY'  
        UNION  
        SELECT DISTINCT bookid  
        FROM locations  
        WHERE locname='ZZZ'  
        GROUP BY bookid)  
    WHERE occurrences>1)

I think this works, can't test it at the moment, but is it the best way to do this?

标签： python postgresql psycopg2

1条回答

Root（大扎）

2楼-- · 2019-06-26 00:12

The idea to use a count for the generalized case is sound. A couple of adjustments to the syntax, though:

WITH b AS (  
   SELECT bookid
   FROM  (
      SELECT DISTINCT bookid  
      FROM   characterNames  
      WHERE  name='XXX'  

      UNION ALL  
      SELECT DISTINCT bookid  
      FROM   characterNames  
      WHERE  name='YYY'  

      UNION ALL
      SELECT DISTINCT bookid  
      FROM   locations  
      WHERE  locname='ZZZ'  
      ) x
   GROUP  BY bookid
   HAVING count(*) > 1
   )
SELECT bookid, position, 'char' AS what
FROM   b
JOIN   characternames USING (bookid)
WHERE  name = 'XXX'

UNION  ALL
SELECT bookid, position, 'loc' AS what
FROM   b
JOIN   locations USING (bookid)
WHERE  locname = 'YYY'
ORDER  BY bookid, position;

Notes

Use UNION ALL (not UNION) to preserve duplicates between the subqueries. You want them in this case to be able to count them.
The subqueries are supposed to produces distinct values. It works with DISTINCT the way you have it. You may want to try GROUP BY 1 instead and see if that performs better (I don't expect it to.)
The GROUP BY hast to go outside the subquery. It would only be applied to the last subquery and makes no sense there as you have DISTINCT bookid already.
The check whether there are more than one hits on a book has to go into a HAVING clause:
```
 HAVING count(*) > 1
```
You can not use aggregated values in a WHERE clause.

Combining conditions on one table

You cannot simply combine multiple conditions on one table. How will you count the number of findings? But there is a somewhat more sophisticated way. May or may not improve performance, You'll have to test (with EXPLAIN ANALYZE). Both queries require at least two index scans for the table characterNames. At least it shortens the syntax.

Consider how I compute the number of hits for characterNames and how I changed to sum(hits) in the outer SELECT:

WITH b AS (  
   SELECT bookid
   FROM  (
      SELECT bookid
           , max((name='XXX')::int)
           + max((name='YYY')::int) AS hits
      FROM   characterNames  
      WHERE  (name='XXX'
           OR name='YYY')
      GROUP  BY bookid

      UNION ALL
      SELECT DISTINCT bookid, 1 AS hits  
      FROM   locations  
      WHERE  locname='ZZZ'  
      ) x
   GROUP  BY bookid
   HAVING sum(hits) > 1
   )
...

Converting a boolean to integer gives 0 for FALSE and 1 for TRUE. That helps.

Faster with EXISTS

While riding my bike to my company this thing kept kicking at the back of my head. I have reason to believe this query might be even faster. Please give it a try:

WITH b AS (  
   SELECT bookid

        , (EXISTS (
            SELECT *
            FROM   characterNames c
            WHERE  c.bookid = b.bookid
            AND    c.name = 'XXX'))::int
        + (EXISTS (
            SELECT *
            FROM   characterNames c
            WHERE  c.bookid = b.bookid
            AND    c.name = 'YYY'))::int AS c_hits

        , (EXISTS (
            SELECT *
            FROM   locations l
            WHERE  l.bookid = b.bookid
            AND    l.locname='ZZZ'))::int AS l_hits
   FROM   books b  
   WHERE  (c_hits + l_hits) > 1
   )
SELECT c.bookid, c.position, 'char' AS what
FROM   b
JOIN   characternames c USING (bookid)
WHERE  b.c_hits > 0
AND    c.name IN ('XXX', 'YYY')

UNION  ALL
SELECT l.bookid, l.position, 'loc' AS what
FROM   b
JOIN   locations l USING (bookid)
WHERE  b.l_hits > 0
AND    l.locname = 'YYY'
ORDER  BY 1,2,3;

The EXISTS semi-join can stop executing at the first match. As we are only interested in an all-or-nothing answer in the CTE, this could possibly do the job much faster.
This way we also don't need to aggregate (no GROUP BY necessary).
I also remember whether any characters or locations were found and only revisit tables with actual matches.

0人赞添加讨论(0) 举报