Select query with three where conditions is slow,

2019-02-10 20:31发布

问题:

I have the following query:

SELECT table_1.id

FROM
table_1
LEFT JOIN table_2 ON (table_1.id = table_2.id)

WHERE
table_1.col_condition_1 = 0
AND table_1.col_condition_2 NOT IN (3, 4)
AND (table_2.id is NULL OR table_1.date_col > table_2.date_col)

LIMIT 5000;

And I have the following keys and indexes:

  • table_1.id primary key.
  • index on table_1.col_condition_1
  • index on table_1.col_condition_2
  • composite index on table_1.col_condition_1 and table_1.col_condition_2

The correct indexes are getting picked up. Query explain:

+--+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------------+---------+------------+----------+-----------------------+--+
|  | id | select_type |  table  |  type  |                            possible_keys                            |          key          | key_len |    ref     |   rows   |         Extra         |  |
+--+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------------+---------+------------+----------+-----------------------+--+
|  |  1 | SIMPLE      | table_1 | range  | "the composite index", col_condition_1 index ,col_condition_2 index | "the composite index" |       7 |            | 11819433 | Using index condition |  |
|  |  1 | SIMPLE      | table_2 | eq_ref | PRIMARY,id_UNIQUE                                                   | PRIMARY               |       8 | table_1.id |        1 | Using where           |  |
+--+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------------+---------+------------+----------+-----------------------+--+

table_1 has ~60 MM records, and table_2 has ~4 MM records.

The query takes 60 seconds to return a result.

What's interesting is that:

SELECT table_1.id

FROM
table_1
LEFT JOIN table_2 ON (table_1.id = table_2.id)

WHERE
table_1.col_condition_1 = 0
AND table_1.col_condition_2 NOT IN (3, 4)

LIMIT 5000;

takes 145 ms to return a result and has the same indexes picked as the first query.

SELECT table_1.id

FROM
table_1
LEFT JOIN table_2 ON (table_1.id = table_2.id)

WHERE
table_1.col_condition_1 = 0
AND (table_2.id is NULL OR table_1.date_col > table_2.date_col)

LIMIT 5000;

takes 174 ms to return a result.

Query explain:

+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------+---------+------------+----------+-------------+
| id | select_type |  table  |  type  |                            possible_keys                            |       key       | key_len |    ref     |   rows   |    Extra    |
+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------+---------+------------+----------+-------------+
|  1 | SIMPLE      | table_1 | ref    | "the composite index", col_condition_1 index ,col_condition_2 index | col_condition_1 |       2 | const      | 30381842 | NULL        |
|  1 | SIMPLE      | table_2 | eq_ref | PRIMARY,id_UNIQUE                                                   | PRIMARY         |       8 | table_1.id |        1 | Using where |
+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------+---------+------------+----------+-------------+

And

SELECT table_1.id

FROM
table_1
LEFT JOIN table_2 ON (table_1.id = table_2.id)

WHERE
table_1.col_condition_2 NOT IN (3, 4)
AND (table_2.id is NULL OR table_1.date_col > table_2.date_col)

LIMIT 5000;

takes about 1 second to return a result.

Query explain:

+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------+---------+------------+----------+-----------------------+
| id | select_type |  table  |  type  |                            possible_keys                            |       key       | key_len |    ref     |   rows   |         Extra         |
+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------+---------+------------+----------+-----------------------+
|  1 | SIMPLE      | table_1 | range  | "the composite index", col_condition_1 index ,col_condition_2 index | col_condition_2 |       5 |            | 36254294 | Using index condition |
|  1 | SIMPLE      | table_2 | eq_ref | PRIMARY,id_UNIQUE                                                   | PRIMARY         |       8 | table_1.id |        1 | Using where           |
+----+-------------+---------+--------+---------------------------------------------------------------------+-----------------+---------+------------+----------+-----------------------+

Also when I use every where condition separately, the query returns a result in ~100 ms.

My question is why the query takes a significant amount of time (60 seconds) to return a result when using the three where conditions together even though it looks like the correct indexes are getting used and executing the query with any two of the three where conditions also returns a result in much less time.

Also, is there a way to optimize this query?

Thank you.

EDIT:

create tables:

table_1:

CREATE TABLE `table_1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `col_condition_1` tinyint(1) DEFAULT '0',
  `col_condition_2` int(11) DEFAULT NULL,
  `date_col` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `compositeidx` (`col_condition_1`,`col_condition_2`),
  KEY `col_condition_1_idx` (`col_condition_1`),
  KEY `col_condition_2_idx` (`col_condition_2`)
) ENGINE=InnoDB AUTO_INCREMENT=68272192 DEFAULT CHARSET=utf8

table_2:

CREATE TABLE `table_2` (
  `id` bigint(20) NOT NULL,
  `date_col` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `id_UNIQUE` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

回答1:

Try to split the existing SQL in two parts and see what are the execution times for each. This would hopefully give you what part is responsible for the slowness:

part 1:

SELECT table_1.id
  FROM table_1
  LEFT JOIN table_2
    ON (table_1.id = table_2.id)
 WHERE table_1.col_condition_1 = 0
   AND table_1.col_condition_2 NOT IN (3, 4)
   AND table_2.id is NULL

and part 2 (note the inner join here):

SELECT table_1.id
  FROM table_1
  JOIN table_2
    ON (table_1.id = table_2.id)
 WHERE table_1.col_condition_1 = 0
   AND table_1.col_condition_2 NOT IN (3, 4)
   AND table_1.date_col > table_2.date_col

I expect the part 2 would be the one to take longer. In this I think an index on both table_1 and table_2 on date_coll would help.

I don't think the composite index would help at all in your select.

This said it is hard to diagnose why the three conditions together would impact the performance that badly. It seems to be related to your data distribution. Not sure about mySql but in Oracle a statistics collections on those tables would make a difference.

Hope it helps.



回答2:

  • OR is a performance killer.
  • Sometimes using UNION instead of OR can speed up the query.
  • Perhaps in one case the 5000 were "near the beginning" of the combined tables, but not in the other case.
  • Using LIMIT without ORDER BY is dubious.
  • Since a PK is a Unique key, it is redundant to also declare id_UNIQUE.
  • INDEX(a) is unnecessary when you also have INDEX(a,b).
  • If there are only 4 values, IN (1, 2) might be faster than NOT IN (3, 4).
  • It is unusual to have two tables sharing the same PK. Why do you have a 1:1 relationship?
  • We might have further insight if we could see the real column names.


回答3:

Problems like this tend to require trying things and testing to see how well they work.

As such, start with this:

SELECT
table_1.id
FROM
table_1
LEFT JOIN table_2
ON table_1.id = table_2.id
AND table_1.date_col <= table_2.date_col
WHERE
table_1.col_condition_1 = 0
AND table_1.col_condition_2 NOT IN (3, 4)
AND table_2.id is NULL

LIMIT 5000;

Logical reasoning on why this is equivalent to your query: Your original query's WHERE statement of (table_2.id is NULL OR table_1.date_col > table_2.date_col) can be summarized as "Only include table_1 records that either do NOT have a table_2 record, or where the table_2 record is earlier than (or equal to) the table_1 record.

My version of the query uses an anti-join to exclude all table_1 records where they exists a table_2 that is earlier than (or equal to) the table_1 record.

Indexes

There are a number of possible composite indexes that may help this query. Here are a couple to start with:

For table_2: (id,date_col)

For table_1: (col_condition_1,id,date_col,col_condition_2)

Please try my query and indexes, and report the results (including EXPLAIN plan).