可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a very slow (usually close to 60 seconds) MySQL query that tries to find correlations between how users voted on one poll and how they voted on all previous polls.
Basically, we gather the user IDs of everyone who voted for one particular option in a given poll.
Then we see how that subgroup voted on each previous poll, and compare those results to how EVERYONE (not just the subgroup) voted on that poll. The difference between the subgroup results and the total results is the deviation, and this query sorts by deviation to determine the strongest correlation.
The query is kind of a mess:
(SELECT p_id as poll_id, o_id AS option_id, description, optCount AS option_count, subgroup_percent, total_percent, ABS(total_percent - subgroup_percent) AS deviation
FROM(
SELECT poll_id AS p_id,
option_id AS o_id,
(SELECT description FROM `option` WHERE id = o_id) AS description,
COUNT(*) AS optCount,
(SELECT COUNT(*) FROM response INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id WHERE option_id = o_id ) /
(SELECT COUNT(*) FROM response INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id WHERE poll_id = p_id) AS subgroup_percent,
(SELECT COUNT(*) FROM response WHERE option_id = o_id) /
(SELECT COUNT(*) FROM response WHERE poll_id = p_id) AS total_percent
FROM response
INNER JOIN user_ids_122 ON response.user_id = user_ids_122.user_id
WHERE poll_id < '61'
GROUP BY option_id DESC
) AS derived_table_122
)
ORDER BY deviation DESC, option_count DESC
Note that user_ids_122 is a previously created temporary table containing the IDs of all users who voted for option ID 122.
The "response" table has about 65,000 rows, the "user" table has about 7,000 rows, and the "option" table has about 130 rows.
UPDATE:
Here's the EXPLAIN table ...
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 121 Using filesort
2 DERIVED user_ids_122 ALL NULL NULL NULL NULL 74 Using temporary; Using filesort
2 DERIVED response ref poll_id,user_id user_id 4 correlated.user_ids_122.user_id 780 Using where
7 DEPENDENT SUBQUERY response ref poll_id poll_id 4 func 7800 Using index
6 DEPENDENT SUBQUERY response ref option_id option_id 4 func 7800 Using index
5 DEPENDENT SUBQUERY user_ids_122 ALL NULL NULL NULL NULL 74
5 DEPENDENT SUBQUERY response ref poll_id,user_id poll_id 4 func 7800 Using where
4 DEPENDENT SUBQUERY user_ids_122 ALL NULL NULL NULL NULL 74
4 DEPENDENT SUBQUERY response ref user_id,option_id user_id 4 correlated.user_ids_122.user_id 780 Using where
3 DEPENDENT SUBQUERY option eq_ref PRIMARY PRIMARY 4 func 1
UPDATE 2:
Every row in the "response" table looks like this:
id (INT) poll_id (INT) user_id (INT) option_id (INT) created (DATETIME)
7 7 1 14 2011-03-17 09:25:10
Every row in the "option" table looks like this:
id (INT) poll_id (INT) text (TEXT) description (TEXT)
14 7 No people who dislike country music
Every row in the "user" table looks like this:
id (INT) email (TEXT) created (DATETIME)
1 user@example.com 2011-02-15 11:16:03
回答1:
3 things :
- You're recalculating the same thing about a zillion and a half times (actually all only depend on some parameters that are the same for many rows)
- Aggregates are more efficients in big chunks (JOINs) than in small bits (subqueries)
- MySQL is extremely slow with subqueries.
So, when you compute "vote counts by option_id" (which needs scanning the big table), and then
you need to compute "vote counts by poll_id", well, do not start the big table again, just use the previous results !
You could do that with a ROLLUP.
Here's a query that will do what you need, running on Postgres.
In order to make MySQL do this, you are going to need to replace all "WITH foo AS (SELECT...)" statements with temporary tables. That's easy. MySQL in-memory temp tables are fast, don't be afraid to use them, since that will allow you to reuse results from the previous steps ans save a lot of computations.
I've generated random-ish test data, seems to work. Executes in 0.3s...
WITH
-- users of interest : target group
uids AS (
SELECT DISTINCT user_id
FROM options
JOIN responses USING (option_id)
WHERE poll_id=22
),
-- votes of everyone and target group
votes AS (
SELECT poll_id, option_id, sum(all_votes) AS all_votes, sum(target_votes) AS target_votes
FROM (
SELECT option_id, count(*) AS all_votes, count(uids.user_id) AS target_votes
FROM responses
LEFT JOIN uids USING (user_id)
GROUP BY option_id
) v
JOIN options USING (option_id)
GROUP BY poll_id, option_id
),
-- totals for all polls (reuse previous result)
totals AS (
SELECT poll_id, sum(all_votes) AS all_votes, sum(target_votes) AS target_votes
FROM votes
GROUP BY poll_id
),
poll_options AS (
SELECT poll_id, count(*) AS poll_option_count
FROM options
GROUP BY poll_id
)
-- reuse previous tables to get some stats
SELECT *, ABS(total_percent - subgroup_percent) AS deviation
FROM (
SELECT
poll_id,
option_id,
v.target_votes / v.all_votes AS subgroup_percent,
t.target_votes / t.all_votes AS total_percent,
poll_option_count
FROM votes v
JOIN totals t USING (poll_id)
JOIN poll_options po USING (poll_id)
) AS foo
ORDER BY deviation DESC, poll_option_count DESC;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=14910.46..14910.56 rows=40 width=144) (actual time=299.844..299.862 rows=200 loops=1)
Sort Key: (abs(((t.target_votes / t.all_votes) - (v.target_votes / v.all_votes)))), po.poll_option_count
Sort Method: quicksort Memory: 52kB
CTE uids
-> HashAggregate (cost=1801.43..1850.52 rows=4909 width=4) (actual time=3.935..4.793 rows=4860 loops=1)
-> Nested Loop (cost=0.00..1789.16 rows=4909 width=4) (actual time=0.029..2.555 rows=4860 loops=1)
-> Seq Scan on options (cost=0.00..3.50 rows=5 width=4) (actual time=0.008..0.032 rows=5 loops=1)
Filter: (poll_id = 22)
-> Index Scan using responses_option_id_key on responses (cost=0.00..344.86 rows=982 width=8) (actual time=0.012..0.298 rows=972 loops=5)
Index Cond: (public.responses.option_id = public.options.option_id)
CTE votes
-> HashAggregate (cost=13029.43..13032.43 rows=200 width=24) (actual time=298.255..298.317 rows=200 loops=1)
-> Hash Join (cost=13019.68..13027.43 rows=200 width=24) (actual time=297.953..298.103 rows=200 loops=1)
Hash Cond: (public.responses.option_id = public.options.option_id)
-> HashAggregate (cost=13014.18..13017.18 rows=200 width=8) (actual time=297.839..297.879 rows=200 loops=1)
-> Merge Left Join (cost=399.13..11541.43 rows=196366 width=8) (actual time=9.301..230.467 rows=196366 loops=1)
Merge Cond: (public.responses.user_id = uids.user_id)
-> Index Scan using responses_pkey on responses (cost=0.00..8585.75 rows=196366 width=8) (actual time=0.015..121.971 rows=196366 loops=1)
-> Sort (cost=399.13..411.40 rows=4909 width=4) (actual time=9.281..22.044 rows=137645 loops=1)
Sort Key: uids.user_id
Sort Method: quicksort Memory: 420kB
-> CTE Scan on uids (cost=0.00..98.18 rows=4909 width=4) (actual time=3.937..6.549 rows=4860 loops=1)
-> Hash (cost=3.00..3.00 rows=200 width=8) (actual time=0.095..0.095 rows=200 loops=1)
-> Seq Scan on options (cost=0.00..3.00 rows=200 width=8) (actual time=0.007..0.043 rows=200 loops=1)
CTE totals
-> HashAggregate (cost=5.50..8.50 rows=200 width=68) (actual time=298.629..298.640 rows=40 loops=1)
-> CTE Scan on votes (cost=0.00..4.00 rows=200 width=68) (actual time=298.257..298.425 rows=200 loops=1)
CTE poll_options
-> HashAggregate (cost=4.00..4.50 rows=40 width=4) (actual time=0.091..0.101 rows=40 loops=1)
-> Seq Scan on options (cost=0.00..3.00 rows=200 width=4) (actual time=0.005..0.020 rows=200 loops=1)
-> Hash Join (cost=6.95..13.45 rows=40 width=144) (actual time=298.994..299.554 rows=200 loops=1)
Hash Cond: (t.poll_id = v.poll_id)
-> CTE Scan on totals t (cost=0.00..4.00 rows=200 width=68) (actual time=298.632..298.669 rows=40 loops=1)
-> Hash (cost=6.45..6.45 rows=40 width=84) (actual time=0.335..0.335 rows=200 loops=1)
-> Hash Join (cost=1.30..6.45 rows=40 width=84) (actual time=0.140..0.263 rows=200 loops=1)
Hash Cond: (v.poll_id = po.poll_id)
-> CTE Scan on votes v (cost=0.00..4.00 rows=200 width=72) (actual time=0.001..0.030 rows=200 loops=1)
-> Hash (cost=0.80..0.80 rows=40 width=12) (actual time=0.130..0.130 rows=40 loops=1)
-> CTE Scan on poll_options po (cost=0.00..0.80 rows=40 width=12) (actual time=0.093..0.119 rows=40 loops=1)
Total runtime: 300.132 ms
回答2:
I think that by the confusion in your query, its making it harder than it should be. I may be very close, but I'll try to go over what I'm doing. It first appears you need your denominator on a per-poll basis... So, my first query does just that... how many per poll (group by poll).
Next, you want to know how many answers were offered per option within each poll. That's what I've done with the second query (group by poll AND option)
Since you are dealing with statistics, it never mattered who answered what as they are already in the responses table. Who cares what their name is... Nor, in the second query, I don't care about the option description, just the counts.
Now that "pre-query 1" and "pre-Query 2" are complete, I can join 1 to 2 based on the common Poll_ID, then join 2 to the options table to get the description you need in your final analysis.
As for the aggregates, at the end of the join, you'll end up with something like
(Result from PreQuery 1 on just the poll counts)
Poll Count
1 50
2 30
(Result from PreQuery 2 on poll AND Option)
Poll Option Count
1 1 30
1 2 12
1 3 5
1 4 3
2 5 8
2 6 12
2 7 10
Final join should have
Poll Option Description PerPollAndOption SubGroup_Percent PerPollResponse
1 1 Descrip 1 30 .60 50
1 2 Descrip 2 12 .24 50
1 3 Descrip 3 5 .10 50
1 4 Descrip 4 3 .06 50
2 5 Descrip 5 8 .27 30
2 6 Descrip 6 12 .40 30
2 7 Descrip 7 10 .33 30
So, the final sorting, grouping, etc you should have a huge simplification with all the numbers here directly available. No need to get to the users as previously stated. If I'm missing something significant, let me know... Maybe this solution will help simplify whatever is left...
SELECT
ByPoll.Poll_ID,
ByPollOption.Option_ID,
Option.Description,
ByPollOption.PerPollAndOption,
ByPollOption.PerPollAndOption / ByPoll.PerPollResponse as SubGroup_Percent,
ByPoll.PerPollResponse
FROM
( select
Poll_ID,
COUNT(*) as PerPollResponse
from
Response
where
Poll_ID < '61'
group by
Poll_ID ) ByPoll
JOIN ( select r.Poll_ID,
r.Option_ID,
COUNT(*) as PerPollAndOption
from
Responses r
join option o
ON r.Option_ID = o.id
where
Poll_ID < '61'
group by
r.Poll_ID,
r.Option_ID ) ByPollOption
ON ByPoll.Poll_ID = ByPollOption.Poll_ID
JOIN OPTION
ON ByPollOption.Option_ID = Option.ID
回答3:
Try appending the things as bite-sized chunks instead:
-- Compute the average you're looking for.
select ..., agg1, agg2, avg(...)
from (
-- Use max() to merge the retrieved aggregates as individual rows.
-- (This will be faster than joins if you're dealing with tons of rows.)
select ..., max(agg1) as agg1, max(agg2) as agg2, ...
from (
-- Compute individual aggregates without nested loops.
select ..., count(*) as agg1, null as agg2, ...
from ...
where ...
group by ...
union all
select ..., null as agg1, count(*) as agg2, ...
from ...
where ...
group by ...
union all
...
) as aggs
group by ...
) as rows
group by ...
If it's still slow after that (I doubt it will be), consider maintaining intermediary results using triggers (if it's being used all the time) or consider using temporary tables (if it's a one-off query that gets fired every so often).
--
Update following on the comment. For instance:
(SELECT COUNT(*) FROM response WHERE option_id = o_id) /
(SELECT COUNT(*) FROM response WHERE poll_id = p_id) as total_percent
would be rewritten like:
SELECT [fields you need],
MAX(total_reponses_by_option_id) / MAX(total_reponses_by_option_id) as total_percent
FROM (
SELECT [fields you need],
COUNT(*) as total_reponses_by_option_id,
NULL as total_reponses_by_poll_id
FROM response
[join/where as needed]
GROUP BY [fields you need]
UNION ALL
SELECT [fields you need],
NULL as total_reponses_by_option_id,
COUNT(*) as total_reponses_by_poll_id
FROM response
[join/where as needed]
GROUP BY [fields you need]
) as agg
GROUP BY [fields you need];