Effective query with multiple conditions

I have a database with

books          (primary key: bookID)
characterNames (foreign key: books.bookID) 
locations      (foreign key: books.bookID)

The in-text-position of character names and locations are saved in the corresponding tables.
Now I want to write a Python script using psycopg2 to find all occurrences of a given character name and a given location in books, where both occur.
At the moment, I execute 4 queries:

SELECT bookID, position FROM characterNames WHERE name='XXX';
--> result is saved in list 'charnames'

SELECT DISTINCT bookID FROM characterNames WHERE name='XXX';
--> result is saved in list 'charnamesIDs'

SELECT bookID, position FROM locations WHERE locName='YYY';
--> result is saved in list 'locs'

SELECT bookID FROM locations WHERE locName='YYY';
--> result is saved in list 'locsIDs'

Both queries could give me bookIDs where just the name OR the location appears. So my goal is to eliminate all elements of 'charnames' with bookIDs not occuring in 'locs' and the other way round. My approach was:

for cnameTuple in charnames:  
~if cnameTuple[0] in locsIDs:  
~~continue  
~del(cname)

I made a corresponding loop for the tuples in locs.
This algorithm unfortunately needs a lot of time. Is there a way to perform this task quicker?

标签： python postgresql psycopg2

2条回答

三岁会撩人

2楼-- · 2019-06-14 08:37

You can use set to see if it speeds up the operation

>>> xxx = set([(1,'a'), (2,'b')])
>>> xxx
set([(1, 'a'), (2, 'b')])
>>> xxx = set([(1,'a'), (3,'c')])
>>> yyy
set([(1, 'a'), (3, 'c')])
>>> c = xxx.intersection(yyy)
>>> c
set([(1, 'a')])   # common between xxx and yyy
>>> xxx - c
set([(2, 'b')])

0人赞添加讨论(0) 举报

我只想做你的唯一

3楼-- · 2019-06-14 08:52

This could be much faster and simpler with a query using JOINs.
Something like this:

SELECT b.*, c.position, l.position
FROM   books b
JOIN   characternames c USING (bookid)
JOIN   locations l USING (bookid)
WHERE  c.name = 'XXX'
AND    l.locname = 'YYY';

It is generally a good idea to only retrieve the data you need from the database server instead of fetching excessive data and filtering in your app. Database engines are optimized for filtering and sorting. You can hardly ever match that in your app to begin with.
Note how I use all lower case names. Read the manual about identifiers.
The manual on the SELECT statement, in particular about JOINs.

More info after comment

"Thousands of books" are no problem at all for a RDBMS like PostgreSQL that is designed to handle millions. The key to performance with large tables are proper indexes. For the queries here the following indexes will potentially help:

CREATE INDEX books_bookid_idx ON books(bookid); -- a primary key will do, too

CREATE INDEX cn_bookid_idx ON characternames (bookid);
CREATE INDEX cn_name_idx ON characternames (name);

CREATE INDEX locations_bookid_idx ON locations (bookid);
CREATE INDEX locations_locname_idx ON locations (locname);

Multicolumn indexes may perform even better. Test with EXPLAIN ANALYZE, it will show you which indexes get used and how fast the query is. Creating indexes is very fast, experimenting with them is easy. Just don't keep indexes you don't need. They carry a maintenance cost, too.

Optimized query

I think I understand now, what you are looking for. This query should be optimized to get all positions of locations or names per bookid, but only where name and location show up in the same book, and no further details per book:

WITH b AS (
    SELECT bookid
    FROM   characternames
    WHERE  name = 'XXX'
    GROUP  BY 1
    INTERSECT
    SELECT bookid
    FROM   locations
    WHERE  l.locname = 'YYY'
    GROUP  BY 1
    )
SELECT bookid, position, 'char' AS what
FROM   b
JOIN   characternames USING (bookid)
WHERE  name = 'XXX'
UNION  ALL
SELECT bookid, position, 'loc' AS what
FROM   b
JOIN   locations USING (bookid)
WHERE  locname = 'YYY'
ORDER  BY bookid, position;

Major points

The CTE (WITH query) makes sure the base query is only executed once.
INTERSECT picks only bookids that feature both location and name.
The UNION ALL in the final SELECT returns all found positions. Use UNION instead if you want to trim duplicates with the same position.
I order by bookid, position - guessing that is what's needed.
Added a column what to tag the source (location or name) of a position.

Further optimization

If search terms appear many times per book you could considerably speed up the search by creating auxiliary tables with distinct entries for (bookid, term). Create a multicolumn primary index on the two columns and an additional one on just term. Create one such table for locations and another one for names. Keep them up to date with triggers if need should be, but I assume the content of books is not changing much. Would simplify and speed up the CTE.

If that still isn't fast enough, look into Full Text Search.

0人赞添加讨论(0) 举报