Optimizing ORDER BY

2019-02-25 04:59发布

问题:

I am trying to optimize this query that sorts posts by reputation field (1st) and then id field (2nd). Without 1st field query takes ~0.250sec, but with it takes up to ~2.500sec (means 10x times slower, terrible). Any suggestion?

SELECT -- everything is ok here
FROM posts AS p
ORDER BY 
    -- 1st: sort by reputation if exists (1 reputation = 1 day)
    (CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
        THEN +p.reputation ELSE NULL END) DESC, -- also used 0 instead of NULL
    -- 2nd: sort by id dec
    p.id DESC
WHERE p.status = 'published' -- the only thing for filter
LIMIT 0,10 -- limit provided as well

Notes:
- Using InnoDB (MySQL 5.7.19)
- Primary is id on posts table
- Fields are indexed both created_at and reputation

Explain result:

# id,  select_type, table, partitions, type,  possible_keys, key,  key_len, ref,  rows,    filtered, Extra
# '1', 'SIMPLE',    'p',   NULL,       'ALL', NULL,          NULL, NULL,    NULL, '31968', '100.00', 'Using filesort'

UPDATE^^

Reputation provides that: A post, how many (n=reputation) day could be shown on the top of list.

Actually, I was trying to give reputations to some posts that could be fetched on the top of list, and find that solution: Order posts by "rep" but only for "one" day limit. But after some time (about 2 years) that solution became a problem now due of increased volume of table data. If I can not resolve this, then I should remove that feature from the service.

UPDATE^^

-- all date's are unix timestamp (bigint)
SELECT p.*
    , u.name user_name, u.status user_status
    , c.name city_name, t.name town_name, d.name dist_name
    , pm.meta_name, pm.meta_email, pm.meta_phone
    -- gets last comment as json
    , (SELECT concat("{", 
        '"id":"', pc.id, '",', 
        '"content":"', replace(pc.content, '"', '\\"'), '",', 
        '"date":"', pc.date, '",', 
        '"user_id":"', pcu.id, '",', 
        '"user_name":"', pcu.name, '"}"') last_comment_json 
        FROM post_comments pc 
        LEFT JOIN users pcu ON (pcu.id = pc.user_id) 
        WHERE pc.post_id = p.id
        ORDER BY pc.id DESC LIMIT 1) AS last_comment
FROM posts p
    -- no issues with these
    LEFT JOIN users u ON (u.id = p.user_id)
    LEFT JOIN citys c ON (c.id = p.city_id)
    LEFT JOIN towns t ON (t.id = p.town_id)
    LEFT JOIN dists d ON (d.id = p.dist_id)
    LEFT JOIN post_metas pm ON (pm.post_id = p.id)
WHERE p.status = 'published'
GROUP BY p.id
ORDER BY 
    -- everything okay until here
    -- any other indexed fields makes query slow, not just "case" part
    (CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
        THEN +p.reputation ELSE NULL END) DESC, 
    -- only id field (primary) is effective, no other indexes 
    p.id DESC
LIMIT 0,10;

Explain;

# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1, PRIMARY, p, , ref, PRIMARY,user_id,status,reputation,created_at,city_id-town_id-dist_id,title-content, status, 1, const, 15283, 100.00, Using index condition; Using temporary; Using filesort
# dunno, these join's are not using, but if i remove returning fields from select part show "Using index condition"
1, PRIMARY, u, , eq_ref, PRIMARY, PRIMARY, 2, p.user_id, 1, 100.00, 
1, PRIMARY, c, , eq_ref, PRIMARY, PRIMARY, 1, p.city_id, 1, 100.00, 
1, PRIMARY, t, , eq_ref, PRIMARY, PRIMARY, 2, p.town_id, 1, 100.00, 
1, PRIMARY, d, , eq_ref, PRIMARY, PRIMARY, 2, p.dist_id, 1, 100.00, 
1, PRIMARY, pp, , eq_ref, PRIMARY, PRIMARY, 2, p.id, 1, 100.00, 
2, DEPENDENT SUBQUERY, pc, , ref, post_id,visibility,status, post_id, 2, func, 2, 67.11, Using index condition; Using where; Using filesort
2, DEPENDENT SUBQUERY, pcu, , eq_ref, PRIMARY, PRIMARY, 2, pc.user_id, 1, 100.00, 

回答1:

This is a very interesting query. During its optimisation you may discover and understand a lot of new information about how MySQL works. I am not sure that I will have time to write everything in details at once, but I can gradually update.

Why it is slow

There are basically two scenarios: a quick and a slow.

In a quick scenario you are walking in some predefined order over a table and probably at the same time quickly fetch some data by id for each row from other tables. It this case you stop walking as soon as you have enough rows specified by your LIMIT clause. Where does the order come from? From a b-tree index that you have on the table or the order of a result set in a subquery.

In a slow scenario you do not have that predefined order, and MySQL has to implicitly put all data into a temporary table, sort the table on some field and return the n rows from your LIMIT clause. If any of the fields that you put into that temporary table is of type TEXT (not VARCHAR), MySQL does not even try to keep that table in RAM and flushes and sorts it on disk (hence additional IO processing).

First thing to fix

There are many situations when you can not build an index that will allow you to follow its order (when you ORDER BY columns from different tables, for example), so the rule of thumb in such situations is to minimise the data that MySQL will put in the temporary table. How can you do it? You select only identifiers of the rows in a subquery and after you have the ids, you join the ids to the table itself and other tables to fetch the content. That is you make a small table with an order and then use the quick scenario. (This slightly contradicts to SQL in general, but each flavor of SQL has its own means to optimise queries that way).

Coincidentally, your SELECT -- everything is ok here looks funny, since it is the first place where it is not ok.

SELECT p.*
    , u.name user_name, u.status user_status
    , c.name city_name, t.name town_name, d.name dist_name
    , pm.meta_name, pm.meta_email, pm.meta_phone
    , (SELECT concat("{", 
        '"id":"', pc.id, '",', 
        '"content":"', replace(pc.content, '"', '\\"'), '",', 
        '"date":"', pc.date, '",', 
        '"user_id":"', pcu.id, '",', 
        '"user_name":"', pcu.name, '"}"') last_comment_json 
        FROM post_comments pc 
        LEFT JOIN users pcu ON (pcu.id = pc.user_id) 
        WHERE pc.post_id = p.id
        ORDER BY pc.id DESC LIMIT 1) AS last_comment
FROM (
    SELECT id
    FROM posts p
    WHERE p.status = 'published'
    ORDER BY 
        (CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
            THEN +p.reputation ELSE NULL END) DESC, 
        p.id DESC
    LIMIT 0,10
) ids
JOIN posts p ON ids.id = p.id  -- mind the join for the p data
LEFT JOIN users u ON (u.id = p.user_id)
LEFT JOIN citys c ON (c.id = p.city_id)
LEFT JOIN towns t ON (t.id = p.town_id)
LEFT JOIN dists d ON (d.id = p.dist_id)
LEFT JOIN post_metas pm ON (pm.post_id = p.id)
;

That is the first step, but even now you can see that you do not need to make these useless LEFT JOINS and json serialisations for the rows you do not need. (I skipped GROUP BY p.id, because I do not see which LEFT JOIN might result in several rows, you do not do any aggregation).

yet to write about:

  • indexes
  • reformulate CASE clause (use UNION ALL)
  • probably forcing an index


回答2:

Here's your problem:

  • "ORDER BY expression": the expression has to be computed for each row in the table, then the sort done on the whole table, then the results go through the LIMIT.
  • No index use: "ORDER BY col" when "col" is part of an index can eliminate the sort by going through the index in-order. This is very efficient when using LIMIT. However, it will not work here.

There are ways out of this mess, but you will need to tell how many different levels of "reputation" you have (like 3, or like "a lot") and how they are statistically distributed (like, 1 user with reputation 100 and the rest all have zero, or evenly distributed).

EDIT

Hmm, no information on the statistical distribution of "reputation" or its possible range of values. In this case, let's go with the blunt approach:

Let's add a column "repdate" which contains:

repdate = p.created_at + INTERVAL p.reputation DAY

This corresponds to shifting posts one day into the future for each reputation point they have. They will then sort accordingly. Adjust to taste if p.created_at is not a DATETIME.

Now, we can simply "ORDER BY repdate DESC" and with an index on it, it will be fast.



回答3:

Perhaps an index with columns: id, reputation, created_at could help to speed up a bit, That would be the easiest solution, if you did not try that yet. The DBMS would not have to read so much data, to calculate the offset, limit - affected records.



回答4:

select * 
from (
  SELECT -- everything is ok here
  , CASE 
      WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY) 
        THEN + p.reputation ELSE NULL END order_col
  FROM posts AS p
  WHERE p.status = 'published' -- the only thing for filter
  LIMIT 0,10 -- limit provided as well
) a
ORDER BY 
    a.order_col desc
    ,a.id DESC


回答5:

  • Inflate-deflate -- LEFT JOIN inflates the number of rows, GROUP BY then deflates. The inflated number of rows is costly. Instead, focus on getting the ids for the desired rows before doing any JOINing. With luck, you can get rid of the GROUP BY.

  • WP schema -- This is an EAV schema which sucks when it comes to performance and scaling.

  • What indexes do you have? See this for how to improve the meta table.

  • Complex ORDER BY. This leads to gathering all the rows (after filtering) before sorting and doing the LIMIT. Rethink the ORDER BY clause, if possible.

After you have done what you can with my suggestions, start another Question to continue the refinement. Be sure to include EXPLAIN SELECT ... and SHOW CREATE TABLE.