I have the following two tables in MySQL (Simplified).
clicks
(InnoDB)- Contains around about 70,000,000 records
- Has an index on the
date_added
column - Has a column
link_id
which refers to a record in thelinks
table
links
(MyISAM)- Contains far fewer records, around about 65,000
I'm trying to run some analytical queries using these tables. I need to pull out some data, about clicks that occurred inside of two specified dates while applying some other user selected filters using other tables and joining them into the links table.
My question revolves around the use of indexes however. When I run the following query:
SELECT
COUNT(1)
FROM
clicks
WHERE
date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-11-03 23:59:59';
I get a response back in 1.40 sec. Using EXPLAIN
I find that the MySQL uses the index on the date_added
column as expected.
EXPLAIN SELECT COUNT(1) FROM clicks WHERE date_added >= '2016-11-01 00:00:00' AND date_added <= '2016-11-16 23:59:59';
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| 1 | SIMPLE | clicks | range | date_added | date_added | 4 | NULL | 1559288 | Using where; Using index |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
However, when I LEFT JOIN
in my links
table I find that the query takes much longer to execute:
SELECT
COUNT(1) AS clicks
FROM
clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
WHERE
c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59';
Which completed in 6.50 sec. Using EXPLAIN
I find that the index was not used on the date_added
column:
EXPLAIN SELECT COUNT(1) AS clicks FROM clicks AS c LEFT JOIN links AS l ON l.id = c.link_id WHERE c.date_added >= '2016-11-01 00:00:00' AND c.date_added <= '2016-11-16 23:59:59';
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| 1 | SIMPLE | c | range | date_added | date_added | 4 | NULL | 6613278 | Using where |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | c.link_id | 1 | Using index |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
As you can see the index isn't being used for the date_added
column in the larger table and seems to take far longer. This seems to get even worse when I join in other tables.
Does anyone know why this is happening or if there's anything I can do to get it to use the index on the date_added
column in the clicks table?
Edit
I've just attempted to get my stats out of the database using a different method. The first step in my method involves pulling out a distinct set of link_id
s from the clicks table. I've found that I'm seeing the same problem here again, without a JOIN. The index is not being used:
My query:
SELECT
DISTINCT(link_id) AS link_id
FROM
clicks
WHERE
date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-12-05 10:16:00'
This query took almost a minute to complete. I ran an EXPLAIN
on this and found that the query is not using the index as I expected it would:
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| 1 | SIMPLE | clicks | index | date_added | link_id | 4 | NULL | 79786609 | Using where |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
I expected that it would use the index on date_added
to filter down the result set and then pull out the distinct link_id
values. Any idea why this is happening? I have an index on link_id
as well as date_added
.