I am trying to optimize this query that sorts posts
by reputation
field (1st) and then id
field (2nd). Without 1st field query takes ~0.250sec, but with it takes up to ~2.500sec (means 10x times slower, terrible). Any suggestion?
SELECT -- everything is ok here
FROM posts AS p
ORDER BY
-- 1st: sort by reputation if exists (1 reputation = 1 day)
(CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY)
THEN +p.reputation ELSE NULL END) DESC, -- also used 0 instead of NULL
-- 2nd: sort by id dec
p.id DESC
WHERE p.status = 'published' -- the only thing for filter
LIMIT 0,10 -- limit provided as well
Notes:
- Using InnoDB (MySQL 5.7.19)
- Primary is id
on posts
table
- Fields are indexed both created_at
and reputation
Explain result:
# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
# '1', 'SIMPLE', 'p', NULL, 'ALL', NULL, NULL, NULL, NULL, '31968', '100.00', 'Using filesort'
UPDATE^^
Reputation provides that: A post, how many (n=reputation) day could be shown on the top of list.
Actually, I was trying to give reputations to some posts that could be fetched on the top of list, and find that solution: Order posts by "rep" but only for "one" day limit. But after some time (about 2 years) that solution became a problem now due of increased volume of table data. If I can not resolve this, then I should remove that feature from the service.
UPDATE^^
-- all date's are unix timestamp (bigint)
SELECT p.*
, u.name user_name, u.status user_status
, c.name city_name, t.name town_name, d.name dist_name
, pm.meta_name, pm.meta_email, pm.meta_phone
-- gets last comment as json
, (SELECT concat("{",
'"id":"', pc.id, '",',
'"content":"', replace(pc.content, '"', '\\"'), '",',
'"date":"', pc.date, '",',
'"user_id":"', pcu.id, '",',
'"user_name":"', pcu.name, '"}"') last_comment_json
FROM post_comments pc
LEFT JOIN users pcu ON (pcu.id = pc.user_id)
WHERE pc.post_id = p.id
ORDER BY pc.id DESC LIMIT 1) AS last_comment
FROM posts p
-- no issues with these
LEFT JOIN users u ON (u.id = p.user_id)
LEFT JOIN citys c ON (c.id = p.city_id)
LEFT JOIN towns t ON (t.id = p.town_id)
LEFT JOIN dists d ON (d.id = p.dist_id)
LEFT JOIN post_metas pm ON (pm.post_id = p.id)
WHERE p.status = 'published'
GROUP BY p.id
ORDER BY
-- everything okay until here
-- any other indexed fields makes query slow, not just "case" part
(CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY)
THEN +p.reputation ELSE NULL END) DESC,
-- only id field (primary) is effective, no other indexes
p.id DESC
LIMIT 0,10;
Explain;
# id, select_type, table, partitions, type, possible_keys, key, key_len, ref, rows, filtered, Extra
1, PRIMARY, p, , ref, PRIMARY,user_id,status,reputation,created_at,city_id-town_id-dist_id,title-content, status, 1, const, 15283, 100.00, Using index condition; Using temporary; Using filesort
# dunno, these join's are not using, but if i remove returning fields from select part show "Using index condition"
1, PRIMARY, u, , eq_ref, PRIMARY, PRIMARY, 2, p.user_id, 1, 100.00,
1, PRIMARY, c, , eq_ref, PRIMARY, PRIMARY, 1, p.city_id, 1, 100.00,
1, PRIMARY, t, , eq_ref, PRIMARY, PRIMARY, 2, p.town_id, 1, 100.00,
1, PRIMARY, d, , eq_ref, PRIMARY, PRIMARY, 2, p.dist_id, 1, 100.00,
1, PRIMARY, pp, , eq_ref, PRIMARY, PRIMARY, 2, p.id, 1, 100.00,
2, DEPENDENT SUBQUERY, pc, , ref, post_id,visibility,status, post_id, 2, func, 2, 67.11, Using index condition; Using where; Using filesort
2, DEPENDENT SUBQUERY, pcu, , eq_ref, PRIMARY, PRIMARY, 2, pc.user_id, 1, 100.00,
This is a very interesting query. During its optimisation you may discover and understand a lot of new information about how MySQL works. I am not sure that I will have time to write everything in details at once, but I can gradually update.
Why it is slow
There are basically two scenarios: a quick and a slow.
In a quick scenario you are walking in some predefined order over a table and probably at the same time quickly fetch some data by id for each row from other tables. It this case you stop walking as soon as you have enough rows specified by your LIMIT clause. Where does the order come from? From a b-tree index that you have on the table or the order of a result set in a subquery.
In a slow scenario you do not have that predefined order, and MySQL has to implicitly put all data into a temporary table, sort the table on some field and return the n rows from your LIMIT clause. If any of the fields that you put into that temporary table is of type TEXT (not VARCHAR), MySQL does not even try to keep that table in RAM and flushes and sorts it on disk (hence additional IO processing).
First thing to fix
There are many situations when you can not build an index that will allow you to follow its order (when you ORDER BY columns from different tables, for example), so the rule of thumb in such situations is to minimise the data that MySQL will put in the temporary table. How can you do it? You select only identifiers of the rows in a subquery and after you have the ids, you join the ids to the table itself and other tables to fetch the content. That is you make a small table with an order and then use the quick scenario. (This slightly contradicts to SQL in general, but each flavor of SQL has its own means to optimise queries that way).
Coincidentally, your SELECT -- everything is ok here
looks funny, since it is the first place where it is not ok.
SELECT p.*
, u.name user_name, u.status user_status
, c.name city_name, t.name town_name, d.name dist_name
, pm.meta_name, pm.meta_email, pm.meta_phone
, (SELECT concat("{",
'"id":"', pc.id, '",',
'"content":"', replace(pc.content, '"', '\\"'), '",',
'"date":"', pc.date, '",',
'"user_id":"', pcu.id, '",',
'"user_name":"', pcu.name, '"}"') last_comment_json
FROM post_comments pc
LEFT JOIN users pcu ON (pcu.id = pc.user_id)
WHERE pc.post_id = p.id
ORDER BY pc.id DESC LIMIT 1) AS last_comment
FROM (
SELECT id
FROM posts p
WHERE p.status = 'published'
ORDER BY
(CASE WHEN p.created_at >= unix_timestamp(now() - INTERVAL p.reputation DAY)
THEN +p.reputation ELSE NULL END) DESC,
p.id DESC
LIMIT 0,10
) ids
JOIN posts p ON ids.id = p.id -- mind the join for the p data
LEFT JOIN users u ON (u.id = p.user_id)
LEFT JOIN citys c ON (c.id = p.city_id)
LEFT JOIN towns t ON (t.id = p.town_id)
LEFT JOIN dists d ON (d.id = p.dist_id)
LEFT JOIN post_metas pm ON (pm.post_id = p.id)
;
That is the first step, but even now you can see that you do not need to make these useless LEFT JOINS and json serialisations for the rows you do not need. (I skipped GROUP BY p.id
, because I do not see which LEFT JOIN might result in several rows, you do not do any aggregation).
yet to write about:
- indexes
- reformulate CASE clause (use UNION ALL)
- probably forcing an index
Here's your problem:
- "ORDER BY expression": the expression has to be computed for each row in the table, then the sort done on the whole table, then the results go through the LIMIT.
- No index use: "ORDER BY col" when "col" is part of an index can eliminate the sort by going through the index in-order. This is very efficient when using LIMIT. However, it will not work here.
There are ways out of this mess, but you will need to tell how many different levels of "reputation" you have (like 3, or like "a lot") and how they are statistically distributed (like, 1 user with reputation 100 and the rest all have zero, or evenly distributed).
EDIT
Hmm, no information on the statistical distribution of "reputation" or its possible range of values. In this case, let's go with the blunt approach:
Let's add a column "repdate" which contains:
repdate = p.created_at + INTERVAL p.reputation DAY
This corresponds to shifting posts one day into the future for each reputation point they have. They will then sort accordingly. Adjust to taste if p.created_at is not a DATETIME.
Now, we can simply "ORDER BY repdate DESC" and with an index on it, it will be fast.