Optimize SQL that uses between clause

2019-02-16 14:52发布

问题:

Consider the following 2 tables:

Table A:
id
event_time

Table B
id
start_time
end_time

Every record in table A is mapped to exactly 1 record in table B. This means table B has no overlapping periods. Many records from table A can be mapped to the same record in table B.

I need a query that returns all A.id, B.id pairs. Something like:

SELECT A.id, B.id 
FROM A, B 
WHERE A.event_time BETWEEN B.start_time AND B.end_time

I am using MySQL and I cannot optimize this query. With ~980 records in table A and 130.000 in table B this takes forever. I understand this has to perform 980 queries, but taking more than 15 minutes on a beefy machine is strange. Any suggestions?

P.S. I cannot change the database schema, but I can add indexes. However an index (with 1 or 2 fields) on the time fields doesn't help.

回答1:

You may want to try something like this

Select A.ID,
(SELECT B.ID FROM B
WHERE A.EventTime BETWEEN B.start_time AND B.end_time LIMIT 1) AS B_ID
FROM A

If you have an index on the Start_Time,End_Time fields for B, then this should work quite well.



回答2:

I'm not sure this can be optimized fully. I tried it on MySQL 5.1.30. I also added an index on {B.start_time, B.end_time} as suggested by other folks. Then I got a report from EXPLAIN, but the best I could get is a Range Access Method:

EXPLAIN SELECT A.id, B.id FROM A JOIN B 
ON A.event_time BETWEEN B.start_time AND B.end_time;

+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                          |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
|  1 | SIMPLE      | A     | ALL  | event_time    | NULL | NULL    | NULL |    8 |                                                | 
|  1 | SIMPLE      | B     | ALL  | start_time    | NULL | NULL    | NULL |   96 | Range checked for each record (index map: 0x4) | 
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+

See the note on the far right. The optimizer thinks it might be able to use the index on {B.start_time, B.end_time} but it ended up deciding not to use that index. Your results may vary, because your data distribution is more representative.

Compare with the index usage if you compare A.event_time to a constant range:

EXPLAIN SELECT A.id FROM A
WHERE A.event_time BETWEEN '2009-02-17 09:00' and '2009-02-17 10:00';

+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key        | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
|  1 | SIMPLE      | A     | range | event_time    | event_time | 8       | NULL |    1 | Using where | 
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+

And compare with the dependent sub-query form given by @Luke and @Kibbee, which seems to make use of indexes more effectively:

EXPLAIN SELECT A.id AS id_from_a,
    (
        SELECT B.id
        FROM B
        WHERE A.id BETWEEN B.start_time AND B.end_time
        LIMIT 0, 1
    ) AS id_from_b
FROM A;

+----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type        | table | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+
|  1 | PRIMARY            | A     | index | NULL          | PRIMARY | 8       | NULL |    8 | Using index | 
|  2 | DEPENDENT SUBQUERY | B     | ALL   | start_time    | NULL    | NULL    | NULL |  384 | Using where | 
+----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+

Weirdly, EXPLAIN lists possible_keys as NULL (i.e. no indexes could be used) but then decides to use the primary key after all. Could be an idiosyncrasy of MySQL's EXPLAIN report?



回答3:

I wouldn't normally recommend a query like this, but...

Since you've specified that table A only has about 980 rows and that each row maps to exactly one row in table B, then you could do the following and it will most likely be a lot faster than a cartesian join:

SELECT A.id AS id_from_a,
    (
        SELECT B.id
        FROM B
        WHERE A.event_time BETWEEN B.start_time AND B.end_time
        LIMIT 0, 1
    ) AS id_from_b
FROM A


回答4:

I have made some tests for a similar problem - calculating a country based on an ip address (given as a number). Here are my data and results:

  • Table A (that contains users and IP addresses) contains about 20 records.
  • Table B (that contains the IP ranges for each country) contains about 100000 records.

The JOIN query using "between" takes about 10 seconds; The SELECT inside a SELECT query, using "between", takes about 5.5 seconds; The SELECT inside a SELECT query, using a spatial index, takes about 6.3 seconds. The JOIN query using a spatial index takes 0 seconds!



回答5:

Notice that when running this query, you actually create 980x130000 records in memory before applying the condition. Such JOIN is not very recommended, and I can see why it'll give you performance issues.



回答6:

If you can't change the schema -- in particular, if you can't add an index on a.event_time, I don't see much room for improvement at the SQL level.

I'd be more inclined to do it in code.

  • read all B start/end/id tuples into a list, sorted on start time
  • read all A events
  • for each A event
    • find the largest start time <= event time (binary search will do fine)
    • if the event time is <= end time, add A to this B's list of events
    • else this B has no home


回答7:

By not changing the schema do mean you can't add an index? Try a multi column index on start_time and end_time.



回答8:

Give a try using standard comparison operator (< and >).



回答9:

I see that you are doing a cross join of two tables. That is not very good, and DBMS will take a lot of time to execute that operation. Cross join is the most exepensive operation in SQL. The reason of so long time of execution could be this.

Do on that way, it could resolve...

SELECT A.id, B.id FROM A, B WHERE A.id = B.id AND A.event_time BETWEEN B.start_time AND B.end_time

I hope this help you :)



回答10:

Is there an index on B (start_time, end_time)? If not, perhaps adding one might speed up the matching of B rows to A rows?

Mind you, if you can't change the schema, maybe you can't create new indexes either?



回答11:

The only way out you have to speed up the execution of this query is by making use of indexes.

Take care to put into an index your A.event_time and then put into another index B.start_time and B.end_time.

If as you said this is the only one condition which binds the two entities together, I think this is the only solution you can take.

Fede



回答12:

Daremon, this answer is based on one of your comments where you said every record in table A maps to only one record in table B,

Can you add an additional table to your schema? If yes, you can pre-compute the result of this query and store it in another table. You will also have to keep this pre-computed table in sync with changes to tables A and B



回答13:

Based on your comment that each entry in A corresponds to exactly one entry in B, the easiest solution would be to remove the AUTOINCREMENT from B's id column, then replace all of B's ids with the ids from A.



回答14:

Put an index on B.start_time descending and then use this query:

 SELECT A.id AS idA,
 (SELECT B.id FROM B WHERE A.event_time > B.start_time LIMIT 0, 1
 ORDER BY B.start_time DESC) AS idB
 FROM A

As the time buckets in B are disjoint this would give you the first matching time bucket und you get rid of the between, but still having the sub-query there. Maybe including the B.id in the index would give you some additional small performance boost. (disclaimer: not sure about the MySQL syntax)



回答15:

I can't think of the reason for you to have a table with 130.000 rows with time intervals. Anyways, there must be a good reason for such design, and if so, you have to avoid trying to compute such a join everytime. So here's my suggestion. I would add a reference to B.id in table A (A.B_ID) and use triggers to maintain consistency. Anytime you add a new record (insert trigger) or the even_time column changes (update trigger), you would recompute the reference to B that this time corresponds to. Your select statement would be reduced to a single select * from A.



回答16:

MySQL doesn't let you to use INDEX ORDER BY WITH RANGE in derived queries.

That's why you'll need to create a user defined function.

Note that if your ranges do overlap, the query will only select one (which started last).

CREATE UNIQUE INDEX ux_b_start ON b (start_date);

CREATE FUNCTION `fn_get_last_b`(event_date TIMESTAMP) RETURNS int(11)
BEGIN
  DECLARE id INT;
  SELECT b.id
  INTO id
  FROM b
  FORCE INDEX (ux_b_start)
  WHERE b.start_time <= event_date
  ORDER BY
    b.start_time DESC
  LIMIT 1;
  RETURN id;
END;

SELECT COUNT(*) FROM a;

1000


SELECT COUNT(*) FROM b;

200000

SELECT *
FROM (
  SELECT fn_get_last_b(a.event_time) AS bid,
         a.*
  FROM a
) ao, b FORCE INDEX (PRIMARY)
WHERE b.id = ao.bid
  AND b.end_time >= ao.event_time

1000 rows fetched in 0,0143s (0,1279s)


回答17:

Personally if you havea one to many relationship and each record in table a only relates to one record in table b, I would store the table b id in table a and then do a regular join to get the data. What you currently have is a bad design that can never be truly efficient.



回答18:

There are two caveats to my solution:

1) You said that you can add indexes but not change the schema so I'm not sure if this would work for you or not as you can't have function based indexes in MySQL and you would need to create an extra column on Table B. 2) The other caveat to this solution is that you must be using the MyISAM engine for Table B. If you cannot use MyISAM then this solution wont work because only MyISAM is supported for Spatial Indexes.

So, assuming that the above two aren't an issue for you, the following should work and give you good performance:

This solution makes use of MySQL's support for Spatial Data (see documentation here). While spatial data types can be added to a variety of storage engines, only MyISAM is supported for Spatial R-Tree Indexes (see documentation here) which are needed in order to get the performance needed. One other limitation is that spatial data types only work with numerical data so you cannot use this technique with string based range queries.

I wont go into the details of the theory behind how spatial types work and how the spatial index is useful but you should look at Jeremy Cole's explanation here in regards to how to use spatial data types and indexes for GeoIP lookups. Also look at the comments as they raise some useful points and alternative if you need raw performance and can give up some accuracy.

The basic premise is that we can take the start/end and use the two of them to create four distinct points, one for each corner of a rectangle centered around 0,0 on a xy grid, and then do a quick lookup into the spatial index to determine if the particular point in time we care about is within the rectangle or not. As mentioned previously, see Jeremy Cole's explanation for a more thorough overview of how this works.

In your particular case we will need to do the following:

1) Alter the table to be a MyISAM table (note you shouldn't do this unless you are fully aware of the consequences of such a change like the lack of transactions and the table locking behavior that are associated with MyISAM).

alter table B engine = MyISAM;

2) Next we add the new column that will hold the spatial data. We will use the polygon data type as we need to be able to hold a full rectangle.

alter table B add column time_poly polygon NOT NULL;

3) Next we populate the new column with the data (please keep in mind that any processes that update or insert into table B will need to get modified to make sure they are populating the new column also). Since the start and end ranges are times, we will need to convert them to numbers with the unix_timestamp function (see documentation here for how it works).

update B set time_poly := LINESTRINGFROMWKB(LINESTRING(
    POINT(unix_timestamp(start_time), -1),
    POINT(unix_timestamp(end_time), -1),
    POINT(unix_timestamp(end_time), 1),
    POINT(unix_timestamp(start_time), 1),
    POINT(unix_timestamp(start_time), -1)
  ));

4) Next we add the spatial index to the table (as mentioned previously, this will only work for a MyISAM table and will produce the error "ERROR 1464 (HY000): The used table type doesn't support SPATIAL indexes").

alter table B add SPATIAL KEY `IXs_time_poly` (`time_poly`);

5) Next you will need to use the following select in order to make use of the spatial index when querying the data.

SELECT A.id, B.id 
FROM A inner join B force index (IXs_time_poly)
ON MBRCONTAINS(B.time_poly, POINTFROMWKB(POINT(unix_timestamp(A.event_time), 0)));

The force index is there to make 100% sure that MySQL will use the index for the lookup. If everything went well running an explain on the above select should show something similar to the following:

mysql> explain SELECT A.id, B.id
    -> FROM A inner join B force index (IXs_time_poly)
    -> on MBRCONTAINS(B.time_poly, POINTFROMWKB(POINT(unix_timestamp(A.event_time), 0)));
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra                                           |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------------------------------+
|  1 | SIMPLE      | A     | ALL  | NULL          | NULL | NULL    | NULL |    1065 |                                                 | 
|  1 | SIMPLE      | B     | ALL  | IXs_time_poly | NULL | NULL    | NULL | 7969897 | Range checked for each record (index map: 0x10) | 
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------------------------------------------+
2 rows in set (0.00 sec)

Please refer to Jeremy Cole's analysis for details about the performance benefits of this method as compared with a between clause.

Let me know if you have any questions.

Thanks,

-Dipin



回答19:

something like this?

SELECT A.id, B.id 
FROM A
JOIN B ON A.id =  B.id 
WHERE A.event_time BETWEEN B.start_time AND B.end_time