Why doesn't MySql automatically optimises BETW

2019-07-16 04:18发布

问题:

I have two query for same output

Slow Query:

SELECT 
    *
FROM
    account_range
WHERE
    is_active = 1 AND '8033576667466317' BETWEEN range_start AND range_end;

Execution Time: ~800 ms.

Explain:

+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+
| id | select_type | table         | partitions | type | possible_keys                             | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | account_range | NULL       | ALL  | range_start,range_end,range_se_active_idx | NULL | NULL    | NULL | 940712 |     2.24 | Using where |
+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+

Very Fast Query: learnt from here

SELECT 
    *
FROM
    account_range
WHERE
    is_active = 1 AND 
    range_start = (SELECT 
            MAX(range_start)
        FROM
            account_range
        WHERE
            range_start <= '8033576667466317') AND 
    range_end = (SELECT 
            MIN(range_end)
        FROM
            account_range
        WHERE
            range_end >= '8033576667466317')

Execution Time: ~1ms

Explain:

+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+
| id | select_type | table         | partitions | type | possible_keys                             | key                 | key_len | ref               | rows | filtered | Extra                        |
+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+
|  1 | PRIMARY     | account_range | NULL       | ref  | range_start,range_end,range_se_active_idx | range_se_active_idx | 125     | const,const,const |    1 |   100.00 | NULL                         |
|  3 | SUBQUERY    | NULL          | NULL       | NULL | NULL                                      | NULL                | NULL    | NULL              | NULL |     NULL | Select tables optimized away |
|  2 | SUBQUERY    | NULL          | NULL       | NULL | NULL                                      | NULL                | NULL    | NULL              | NULL |     NULL | Select tables optimized away |
+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+

Table Structure:

CREATE TABLE account_range (
    id int(11) unsigned NOT NULL AUTO_INCREMENT,
    range_start varchar(20) NOT NULL,
    range_end varchar(20) NOT NULL,
    is_active tinyint(1) NOT NULL,
    bank_name varchar(100) DEFAULT NULL,
    addedon timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updatedon timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    description text,
    PRIMARY KEY (id),
    KEY range_start (range_start),
    KEY range_end (range_end),
    KEY range_se_active_idx (range_start , range_end , is_active)
)  ENGINE=InnoDB AUTO_INCREMENT=946132 DEFAULT CHARSET=utf8;

Please do explain Why doesn't MySql automatically optimizes BETWEEN query?

Update:
Realised my mistake from @kordirko answer. My table contains only non-overlapping ranges, so both queries are returning same results.

回答1:

Such a comparision doesn't make sense, since you are comparing apples to oranges.

These two queries are not eqivalent, they give different resuts,
thus MySql optimises them in a different way and their plans can differ.

See this simple example: http://sqlfiddle.com/#!9/98678/2

create table account_range(
  is_active int,
  range_start int,
  range_end int
 );

 insert into account_range values
 (1,-20,100), (1,10,30);

First query gives 2 rows:

select * from account_range
 where is_active = 1 and 25 between range_start AND range_end;

| is_active | range_start | range_end |
|-----------|-------------|-----------|
|         1 |         -20 |       100 |
|         1 |          10 |        30 |

Second query gives only 1 row:

SELECT * FROM account_range
WHERE
    is_active = 1 AND 
    range_start = (SELECT MAX(range_start)
                   FROM account_range
                   WHERE range_start <= 25
    ) AND 
    range_end = (SELECT MIN(range_end)
                 FROM account_range
                 WHERE range_end >= 25
    )

| is_active | range_start | range_end |
|-----------|-------------|-----------|
|         1 |          10 |        30 |

To speed this query up (the first one), two bitmap indexes can be used together with "bitmap and" operation - but MySql doesn't have such a feature.

Another option is a spatial index (for example GIN indexes in PostgreSql: http://www.postgresql.org/docs/current/static/textsearch-indexes.html).

And another option is a star transformation (or a star schema) - you need to "divide" this table into two "dimension" or "measures" tables and one "fact" table .... but this is too broad topic, if you want know more you can start from here: https://en.wikipedia.org/wiki/Star_schema



回答2:

Second query is fast because MySQL is able to use available index.

SELECT * FROM account_range
WHERE
   is_active = 1 AND 
   range_start = a_constant_value_1 AND 
   range_end = a_constant_value_2

Above query is fast because range_se_active_idx index can satisfy search criteria so it is used.

Both subqueries are also fast (see Select tables optimized away in EXPLAIN's output)

   SELECT MAX(range_start) FROM account_range
    WHERE range_start <= '8033576667466317'

   SELECT MIN(range_end) FROM account_range
    WHERE range_end >= '8033576667466317'

because range_start and range_end both are indexed, they are ordered.

With ordered data, for first sub query, MySQL basically just picks one record whose range_start equals 8033576667466317 or one below it (MAX(range_start)). For second sub query, MySQL picks one record whose range_end equals 8033576667466317 or one above it (MIN(range_end)).

For BETWEEN ... AND .. query, MySQL cannot find any indices because that is not a range search. It's basically same as

SELECT * FROM account_range
WHERE
  is_active = 1 AND 
  range_start >= '8033576667466317' AND
  range_end <= '8033576667466317';

It has to search records with range_start from 8033576667466317 to largest value and also from smallest range_end to 8033576667466317. All indices cannot satisfy this criteria so it has to scan table.

I believe it can be optimized if you can rewrite it into something like this:

SELECT * FROM account_range
WHERE
  is_active = 1 AND 
  (range_start BETWEEN a_min_value AND a_max_value) AND
  (range_end BETWEEN a_min_value AND a_max_value);