I have a database with
books (primary key: bookID)
characterNames (foreign key: books.bookID)
locations (foreign key: books.bookID)
The in-text-position of character names and locations are saved in the corresponding tables.
I'm writing a Pythonscript using psycopg2, finding all occurences of given character names and locations in books. I only want the occurences in books, where both the character name AND the location are found.
Here I already got a solution for searching one location and one character:
WITH b AS (
SELECT bookid
FROM characternames
WHERE name = 'XXX'
GROUP BY 1
INTERSECT
SELECT bookid
FROM locations
WHERE l.locname = 'YYY'
GROUP BY 1
)
SELECT bookid, position, 'char' AS what
FROM b
JOIN characternames USING (bookid)
WHERE name = 'XXX'
UNION ALL
SELECT bookid, position, 'loc' AS what
FROM b
JOIN locations USING (bookid)
WHERE locname = 'YYY'
ORDER BY bookid, position;
The CTE 'b' contains all bookid's, where the character name 'XXX' and the location 'YYY' appear.
Now I'm additionally wondering about searching for 2 places and a name (or 2 names and a place respectively). It's simple if all searched entities must occur in one book, but what about this:
Searching for: Tim, Al, Toolshop
Results: books including
(Tim, Al, Toolshop) or
(Tim, Al) or
(Tim, Toolshop) or
(Al, Toolshop)
The problem could be repeated for 4, 5, 6...conditions.
I thougt about INTERSECTing more subqueries, but that wouldn't work.
Instead I would UNION the found bookIDs, GROUP them and select bookid's occurring more then once:
WITH b AS (
SELECT bookid, count(bookid) AS occurrences
FROM
(SELECT DISTINCT bookid
FROM characterNames
WHERE name='XXX'
UNION
SELECT DISTINCT bookid
FROM characterNames
WHERE name='YYY'
UNION
SELECT DISTINCT bookid
FROM locations
WHERE locname='ZZZ'
GROUP BY bookid)
WHERE occurrences>1)
I think this works, can't test it at the moment, but is it the best way to do this?
The idea to use a count for the generalized case is sound. A couple of adjustments to the syntax, though:
Notes
Use
UNION ALL
(notUNION
) to preserve duplicates between the subqueries. You want them in this case to be able to count them.The subqueries are supposed to produces distinct values. It works with
DISTINCT
the way you have it. You may want to tryGROUP BY 1
instead and see if that performs better (I don't expect it to.)The
GROUP BY
hast to go outside the subquery. It would only be applied to the last subquery and makes no sense there as you haveDISTINCT bookid
already.The check whether there are more than one hits on a book has to go into a
HAVING
clause:You can not use aggregated values in a
WHERE
clause.Combining conditions on one table
You cannot simply combine multiple conditions on one table. How will you count the number of findings? But there is a somewhat more sophisticated way. May or may not improve performance, You'll have to test (with
EXPLAIN ANALYZE
). Both queries require at least two index scans for the tablecharacterNames
. At least it shortens the syntax.Consider how I compute the number of hits for
characterNames
and how I changed tosum(hits)
in the outerSELECT
:Converting a
boolean
tointeger
gives0
forFALSE
and1
forTRUE
. That helps.Faster with EXISTS
While riding my bike to my company this thing kept kicking at the back of my head. I have reason to believe this query might be even faster. Please give it a try:
The
EXISTS
semi-join can stop executing at the first match. As we are only interested in an all-or-nothing answer in the CTE, this could possibly do the job much faster.This way we also don't need to aggregate (no
GROUP BY
necessary).I also remember whether any characters or locations were found and only revisit tables with actual matches.