可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a fairly stable directed graph of order ~100k vertices and size ~1k edges. It is two-dimensional insofar as its vertices can be identified by a pair of integers (x, y)
(of cardinality ~100 x ~1000) and all edges are strictly increasing in x
.
There is furthermore a dictionary of ~1k (key, val)
pairs associated with each vertex.
I am currently storing the graph in a MySQL database across three (InnoDB) tables: a table of vertices (which I don't think is relevant to my question, so I have omitted to include both it and the foreign key constraints that refer to it in my extracts below); a table which holds the dictionaries; and a 'closure table' of connected vertices as described so eloquently by Bill Karwin.
The table of vertex dictionaries is defined as follows:
CREATE TABLE `VertexDictionary` (
`x` smallint(6) unsigned NOT NULL,
`y` smallint(6) unsigned NOT NULL,
`key` varchar(50) NOT NULL DEFAULT '',
`val` smallint(1) DEFAULT NULL,
PRIMARY KEY (`x`, `y` , `key`),
KEY `dict` (`x`, `key`, `val`)
);
and the closure table of connected vertices as:
CREATE TABLE `ConnectedVertices` (
`tail_x` smallint(6) unsigned NOT NULL,
`tail_y` smallint(6) unsigned NOT NULL,
`head_x` smallint(6) unsigned NOT NULL,
`head_y` smallint(6) unsigned NOT NULL,
PRIMARY KEY (`tail_x`, `tail_y`, `head_x`),
KEY `reverse` (`head_x`, `head_y`, `tail_x`),
KEY `fx` (`tail_x`, `head_x`),
KEY `rx` (`head_x`, `tail_x`)
);
There is also a dictionary of (x, key)
pairs such that for each such pair, all vertices identified with that x
have within their dictionaries a value for that key
. This dictionary is stored in a fourth table:
CREATE TABLE `SpecialKeys` (
`x` smallint(6) unsigned NOT NULL,
`key` varchar(50) NOT NULL DEFAULT '',
PRIMARY KEY (`x`),
KEY `xkey` (`x`, `key`)
);
I often wish to extract the set of keys used in the dictionaries of all vertices having a particular x=X
, together with the associated value of any SpecialKeys
connected to the left:
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`ConnectedVertices` AS `c`
JOIN `VertexDictionary` AS `u` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
JOIN `VertexDictionary` AS `v` ON (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
WHERE
`v`.`x` = X
;
for which the EXPLAIN
output is:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE k index PRIMARY,xkey xkey 154 NULL 40 Using index; Using temporary
1 SIMPLE c ref PRIMARY,reverse,fx,rx PRIMARY 2 db.k.x 1 Using where
1 SIMPLE v ref PRIMARY,dict PRIMARY 4 const,db.c.head_y 136 Using index
1 SIMPLE u eq_ref PRIMARY,dict PRIMARY 156 db.c.tail_x,db.c.tail_y,db.k.key 1 Using where
But this query takes ~10s to complete. Been banging my head against a brick wall trying to improve matters, but to no avail.
Can the query be improved, or should I consider a different data structure? Extremely grateful for your thoughts!
UPDATE
I'm still getting nowhere with this, although I did rebuild the tables and found the EXPLAIN
output to be slightly different (as now shown above, the number of rows fetched from v
had increased from 1 to 136!); the query is still taking ~10s to execute.
I really don't understand what's going on here. Queries to obtain all (x, y, SpecialValue)
and all (x, y, key)
tuples are both very fast (~30ms and ~150ms respectively), yet essentially joining the two takes over fifty times longer than their combined time... how can I improve the time taken to perform that join?
Output of SHOW VARIABLES LIKE '%innodb%';
below:
Variable_name Value
------------------------------------------------------------
have_innodb YES
ignore_builtin_innodb ON
innodb_adaptive_flushing ON
innodb_adaptive_hash_index ON
innodb_additional_mem_pool_size 2097152
innodb_autoextend_increment 8
innodb_autoinc_lock_mode 1
innodb_buffer_pool_size 1179648000
innodb_change_buffering inserts
innodb_checksums ON
innodb_commit_concurrency 0
innodb_concurrency_tickets 500
innodb_data_file_path ibdata1:10M:autoextend
innodb_data_home_dir /rdsdbdata/db/innodb
innodb_doublewrite ON
innodb_fast_shutdown 1
innodb_file_format Antelope
innodb_file_format_check Barracuda
innodb_file_per_table ON
innodb_flush_log_at_trx_commit 1
innodb_flush_method O_DIRECT
innodb_force_recovery 0
innodb_io_capacity 200
innodb_lock_wait_timeout 50
innodb_locks_unsafe_for_binlog OFF
innodb_log_buffer_size 8388608
innodb_log_file_size 134217728
innodb_log_files_in_group 2
innodb_log_group_home_dir /rdsdbdata/log/innodb
innodb_max_dirty_pages_pct 75
innodb_max_purge_lag 0
innodb_mirrored_log_groups 1
innodb_old_blocks_pct 37
innodb_old_blocks_time 0
innodb_open_files 300
innodb_read_ahead_threshold 56
innodb_read_io_threads 4
innodb_replication_delay 0
innodb_rollback_on_timeout OFF
innodb_spin_wait_delay 6
innodb_stats_method nulls_equal
innodb_stats_on_metadata ON
innodb_stats_sample_pages 8
innodb_strict_mode OFF
innodb_support_xa ON
innodb_sync_spin_loops 30
innodb_table_locks ON
innodb_thread_concurrency 0
innodb_thread_sleep_delay 10000
innodb_use_sys_malloc ON
innodb_version 1.0.16
innodb_write_io_threads 4
回答1:
Without spending time testing it, you provided an incomplete example?
you should definitely try reordering of joined tables. Explain output provides some info, let's say ordering by key_len should be heuristically fastest. First table to be filtered on should be listed as last in case the optimizer is not able to figure that out, I believe.
So, let's say 'c, v, k, u' order is the best.
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`VertexDictionary` AS `u`
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
JOIN `VertexDictionary` AS `v`
JOIN `ConnectedVertices` AS `c` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
AND (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
`v`.`x` = X
;
'rows' would suggest 'c/u, k, v' order, but that depends on data:
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`VertexDictionary` AS `u`
JOIN `VertexDictionary` AS `v`
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
JOIN `ConnectedVertices` AS `c` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
AND (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
`v`.`x` = X
;
Hope this helps.
UPDATE (avoiding the varchar join):
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`ConnectedVertices` AS `c`
JOIN `VertexDictionary` AS `u` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
JOIN `VertexDictionary` AS `v` ON (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
(`u`.`x`, `u`.`key`) IN (SELECT `k`.`x`, `k`.`key` FROM `SpecialKeys` AS `k`)
AND
`v`.`x` = X
;
回答2:
Others might disagree, but I've had and regularly offer STRAIGHT_JOIN for queries... Once you KNOW the data and the relationships. Being that your WHERE clause is against the "V" table alias and it's "x" value, you are good with the index. Move THAT to the front position, then join from that.
SELECT STRAIGHT_JOIN DISTINCT
v.`key`,
u.`val`
FROM
VertexDictionary AS v
JOIN ConnectedVertices AS c
ON v.x = c.head_x
AND v.y = c.head_y
JOIN VertexDictionary AS u
ON c.tail_x = u.x
AND c.tail_y = u.y
JOIN SpecialKeys AS k
ON u.x = k.x
AND u.key = k.key
WHERE
v.x = {some value}
Curious to know how this realignment works for you
回答3:
Try rebuilding the query in stages; or at least give us some more points to identify where the bottlenecks are. Some combinations of the following queries should give you reasonable performance, if it is possible with out modifying the schema or data set.
What is the number of rows and exec times for the following queries for getting the list of suitable tail verticies (ie, that have a SpecialKey)
SELECT -- DISTINCT
vd.x as tail_x, vd.y as tail_y, vd.val
FROM
VertexDictionary vd
WHERE
EXISTS (
SELECT
1
FROM
SpecialKeys sk
WHERE
vd.x = sk.x
AND
vd.key = sk.key
)
or
SELECT -- DISTINCT
vd.x as tail_x, vd.y as tail_y, vd.val
FROM
VertexDictionary vd
JOIN
SpecialKeys sk
ON
vd.x = sk.x
AND
vd.key = sk.key
or
SELECT -- DISTINCT
vd.x as tail_x, vd.y as tail_y, vd.val
FROM
VertexDictionary vd
WHERE
(vd.x, vd.key) IN (SELECT x, key FROM SpecialKeys)
-- also could try vd.key IN (SELECT sk.key FROM SpecialKeys sk WHERE sk.x = vd.x)
I'm hoping that one of these returns either small result set, or are at least quick to produce results. if low cardinality & large results apply distinct.
pick the best one from the previous two queries, and add to the next step: joining these suitable 'tails' to 'suitable heads'
SELECT -- DISTINCT
cv.head_y as y,
tv.val
FROM
(
-- ADD SUB QUERY HERE also try nesting the subquery like: (select tail_x, tail_y, val from ([SUBQUERY]) as sq)
) as tv -- tail verticies
JOIN
ConnectedVerticies cv
ON
cv.tail_x = tv.tail_x
AND
cv.tail_y = tv.tail_y
WHERE
cv.head_x = X -- lets reduce the result set here.
Again, I'm hoping that one of these returns either small result set, or are at least quick to produce results. if low cardinality & large results apply distinct.
If it's falling over at this point, well there's not much hope of it getting faster to apply the last phase, and best to try a different approach.
As head x is known from the earlier query, we now just need to join on head_y and X to get v.key
SELECT DISTINCT
inner_query.val,
head.key
FROM
(
-- previous nested subquery behemoth here, again, try a few things that might work.
) as inner_query
JOIN
VertexDictionary as head
ON
head.x = X
AND
head.y = inner_query.y
Another approach, is to get a list of head.key, tail_x, and tail_y from
SELECT -- DISTINCT
cv.tail_x as x,
cv.tail_y as y,
vd.key
FROM
VertexDictionary vd
JOIN
ConnectedVerticies cv
ON
cv.head_x = vd.x
AND
cv.head_y = vd.y
WHERE
vd.head_x = X
How long does this take to execute, with & without distinct? how many results (w & w/o distinct)?
If it's fast and/or small, try using it as a subquery and joining to either another subquery potentiall of SpecialKeys & VertexDictionary if that's small (ie one of the first three queries if they worked well).
回答4:
I suspect your problem is everything with the syntax
(k
.x
, k
.key
) = (u
.x
, u
.key
)
Can you rewrite as?
k.x = y.x and k.key = u.key
When you have a calculation on the left hand side of a clause, the dbms cannot optimize. By setting the comparison as a straight comparison, you may improve your performance.
e.g.
year(my_date) = '2012'
is slower than
'2012' = year(my_date)
I'm not sure if mysql treats the comparison as a column comparison or as a calculation.
Please try modifying your query to do column value comparisons.
Second optimization
Also - you are cross joining 4 tables. Multiplication is not additive - is it exponential. Are you sure this is what you intend? You may be better served getting starting with the smallest result set and then join only that result set to the next set.
select a.c1
from (
select t1.c1
from t1
join t2 on t1.c1 = t2.c1
) a
join t3 on t3.c1 = a.c1
etc...
third optimization
if option 2 helps, you may want to create indexed views and work from those instead of directly from the tables.
fourth optimization
don't use mysql. unless you have a team of dbas constantly monitoring for performance and tweaks, you will run into bad times with mysql. mysql is fine and fast with simple things, but starts sucking very badly if you do anything moderately complex. 4 years ago, i migrated from mysql to sql server express and my 10 minute queries took <2 sec with the same tables, indices, and queries...
if you want open source, postgres is much smarter than mysql as well
Create a view that incorporates the first 3 tables that is indexed on the v.key, u.val fields.
Then run the query off of the 4th table and the view. Make sure the indices are built on the view before running.
回答5:
DISTINCT
is often a bad friend. Try to replace it with a GROUP BY
.
Like this :
SELECT sub.key, sub.val
FROM (
SELECT
v.key,
u.val
FROM
ConnectedVertices AS c
JOIN VertexDictionary AS u ON (u.x, u.y ) = (c.tail_x, c.tail_y)
JOIN VertexDictionary AS v ON (v.x, v.y ) = (c.head_x, c.head_y)
JOIN SpecialKeys AS k ON (k.x, k.key) = (u.x, u.key)
WHERE (v.x = @X)
) AS sub
GROUP BY sub.key, sub.val
UPDATE:
Then try the following query which forces the indexes to use:
SELECT DISTINCT
v.key,
u.val
FROM
ConnectedVertices AS c USE INDEX (fx,rx)
JOIN VertexDictionary AS u USE INDEX (primary) ON (u.x, u.y ) = (c.tail_x, c.tail_y)
JOIN VertexDictionary AS v USE INDEX (primary) ON (v.x, v.y ) = (c.head_x, c.head_y)
JOIN SpecialKeys AS k USE INDEX (primary) ON (k.x, k.key) = (u.x, u.key)
WHERE (v.x = @X)
If it still not better, try this one :
SELECT DISTINCT
v.key,
u.val
FROM
ConnectedVertices AS c
JOIN VertexDictionary AS u ON (u.x=c.tail_x) AND (u.y=c.tail_y)
JOIN VertexDictionary AS v ON (v.x=@X) AND (v.y=c.head_y)
JOIN SpecialKeys AS k ON (k.x=u.x) AND (k.key=u.key)
WHERE
v.x = @X
回答6:
i don't think that forcing uses of specifique indexes is a good think.
the Mysql optimiser has often good estimations.
do you have an index on v
.x
?