Optimising MySQL queries with heavy joins

2019-02-18 23:46发布

问题:

I currently run a site which tracks up-to-the-minute scores and ratings in a list. The list has thousands of entries that are updated frequently, and the list should be sortable by these score and ratings columns.

My SQL for getting this data currently looks like (roughly):

SELECT e.*, SUM(sa.amount) AS score, AVG(ra.rating) AS rating
FROM entries e 
LEFT JOIN score_adjustments sa ON sa.entry_id = e.id
    HAVING sa.created BETWEEN ... AND ... 
LEFT JOIN rating_adjustments ra ON ra.entry_id = e.id
    HAVING ra.rating > 0 
ORDER BY score 
LIMIT 0, 10

Where the tables are (simplified):

entries:
    id: INT(11) PRIMARY
    ...other data...

score_adjustments:
    id: INT(11), PRIMARY
    entry_id: INT(11), INDEX, FOREIGN KEY (entries.id)
    created: DATETIME
    amount: INT(4)

rating_adjustments:
    id: INT(11), PRIMARY
    entry_id: INT(11), INDEX, FOREIGN KEY (entries.id)
    rating: DOUBLE

There are approx 300,000 score_adjustments entries and they grow at about 5,000 a day. The rating_adjustments is about 1/4 that.

Now, I'm no DBA expert but I'm guessing calling SUM() and AVG() all the time isn't a good thing - especially when sa and ra contain hundreds of thousands of records - right?

I already do caching on the query, but I want the query itself to be fast - yet still as up to date as possible. I was wondering if anyone could share any solutions to optimise heavy join/aggregation queries like this? I'm willing to make structural changes if necessary.

EDIT 1

Added more info about the query.

回答1:

Your data is badly clustered.

InnoDB will store rows with "close" PKs physically close together. Since your child tables use surrogate PKs, their rows will be stored in effect randomly. When the time comes to make calculations for the given row in the "master" table, DBMS must jump all over the place to gather the related rows from the child tables.

Instead of surrogate keys, try using more "natural" keys, with the parent's PK in the leading edge, similar to this:

score_adjustments:
    entry_id: INT(11), FOREIGN KEY (entries.id)
    created: DATETIME
    amount: INT(4)
    PRIMARY KEY (entry_id, created)

rating_adjustments:
    entry_id: INT(11), FOREIGN KEY (entries.id)
    rating_no: INT(11)
    rating: DOUBLE
    PRIMARY KEY (entry_id, rating_no)

NOTE: This assumes created's resolution is fine enough and the rating_no was added to allow multiple ratings per entry_id. This is just an example - you may vary the PKs according to your needs.

This will "force" rows belonging to the same entry_id to be stored physically close together, so a SUM or AVG can be calculated by just a range scan on the PK/clustering key and with very few I/Os.

Alternatively (e.g. if you are using MyISAM that doesn't support clustering), cover the query with indexes so the child tables are not touched during querying at all.


On top of that, you could denormalize your design, and cache the current results in the parent table:

  • Store SUM(score_adjustments.amount) as a physical field and adjust it via triggers every time a row is inserted, updated or deleted from score_adjustments.
  • Store SUM(rating_adjustments.rating) as "S" and COUNT(rating_adjustments.rating) as "C". When a row is added to rating_adjustments, add it to S and increment C. Calculate S/C at run-time to get the average. Handle updates and deletes similarly.


回答2:

If you're worried about performance you could add the score and rating columns to the corresponding tables and update them on insert or update to the referenced tables using a trigger. This would cache the new results every time they are updated and you won't have to recalculate them every time, significantly reducing the amount of joining needed to get the results... just guessing but in most cases the results of your query are probably much more often fetched than updated.

Check out this sql fiddle http://sqlfiddle.com/#!2/b7101/1 to see how to make the triggers and their effect, I only added triggers on insert, you can add update triggers just as easily, if you ever delete data add triggers for delete as well.

Didn't add the datetime field, if the between ... and ... parameters change often you might have to still do that manually every time, otherwise you can just add the between clause to the score_update trigger.