How can I further optimize a derived table query w

2019-01-17 04:26发布

问题:

UPDATE: I found a solution. See my Answer below.

My Question

How can I optimize this query to minimize my downtime? I need to update over 50 schemas with the number of tickets ranging from 100,000 to 2 million. Is it advisable to attempt to set all fields in tickets_extra at the same time? I feel that there is a solution here that I'm just not seeing. Ive been banging my head against this problem for over a day.

Also, I initially tried without using a sub SELECT, but the performance was much worse than what I currently have.

Background

I'm trying to optimize my database for a report that needs to be run. The fields I need to aggregate on are very expensive to calculate, so I am denormalizing my existing schema a bit to accommodate this report. Note that I simplified the tickets table quite a bit by removing a few dozen irrelevant columns.

My report will be aggregating ticket counts by Manager When Created and Manager When Resolved. This complicated relationship is diagrammed here:

EAV http://cdn.cloudfiles.mosso.com/c163801/eav.png

To avoid the half dozen nasty joins required to calculate this on-the-fly I've added the following table to my schema:

mysql> show create table tickets_extra\G
*************************** 1. row ***************************
       Table: tickets_extra
Create Table: CREATE TABLE `tickets_extra` (
  `ticket_id` int(11) NOT NULL,
  `manager_created` int(11) DEFAULT NULL,
  `manager_resolved` int(11) DEFAULT NULL,
  PRIMARY KEY (`ticket_id`),
  KEY `manager_created` (`manager_created`,`manager_resolved`),
  KEY `manager_resolved` (`manager_resolved`,`manager_created`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

The problem now is, I haven't been storing this data anywhere. The manager was always calculated dynamically. I have millions of tickets across several databases with the same schema that need to have this table populated. I want to do this in as efficient a way as possible, but have been unsuccessful in optimizing the queries I'm using to do so:

INSERT INTO tickets_extra (ticket_id, manager_created)
SELECT
  t.id, 
  su.user_id
FROM (
  SELECT 
    t.id, 
    shift_times.shift_id AS shift_id 
  FROM tickets t
  JOIN shifts ON t.shop_id = shifts.shop_id 
  JOIN shift_times ON (shifts.id = shift_times.shift_id
  AND shift_times.dow = DAYOFWEEK(t.created)
  AND TIME(t.created) BETWEEN shift_times.start AND shift_times.end)
) t
LEFT JOIN shifts_users su ON t.shift_id = su.shift_id
LEFT JOIN shift_positions ON su.shift_position_id = shift_positions.id
WHERE shift_positions.level = 1

This query takes over an hour to run on a schema that has > 1.7 million tickets. This is unacceptable for the maintenance window I have. Also, it doesn't even handle calculating the manager_resolved field, as attempting to combine that into the same query pushes the query time into the stratosphere. My current inclination is to keep them separate, and use an UPDATE to populate the manager_resolved field, but I'm not sure.

Finally, here is the EXPLAIN output of the SELECT portion of that query:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 167661
        Extra: 
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: su
         type: ref
possible_keys: shift_id_fk_idx,shift_position_id_fk_idx
          key: shift_id_fk_idx
      key_len: 4
          ref: t.shift_id
         rows: 5
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: shift_positions
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 6
        Extra: Using where; Using join buffer
*************************** 4. row ***************************
           id: 2
  select_type: DERIVED
        table: t
         type: ALL
possible_keys: fk_tickets_shop_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 173825
        Extra: 
*************************** 5. row ***************************
           id: 2
  select_type: DERIVED
        table: shifts
         type: ref
possible_keys: PRIMARY,shop_id_fk_idx
          key: shop_id_fk_idx
      key_len: 4
          ref: dev_acmc.t.shop_id
         rows: 1
        Extra: 
*************************** 6. row ***************************
           id: 2
  select_type: DERIVED
        table: shift_times
         type: ref
possible_keys: shift_id_fk_idx
          key: shift_id_fk_idx
      key_len: 4
          ref: dev_acmc.shifts.id
         rows: 4
        Extra: Using where
6 rows in set (6.30 sec)

Thank you so much for reading!

回答1:

Well, I found a solution. It took a lot of experimentation, and I think a good bit of blind luck, but here it is:

CREATE TABLE magic ENGINE=MEMORY
SELECT
  s.shop_id AS shop_id,
  s.id AS shift_id,
  st.dow AS dow,
  st.start AS start,
  st.end AS end,
  su.user_id AS manager_id
FROM shifts s
JOIN shift_times st ON s.id = st.shift_id
JOIN shifts_users su ON s.id = su.shift_id
JOIN shift_positions sp ON su.shift_position_id = sp.id AND sp.level = 1

ALTER TABLE magic ADD INDEX (shop_id, dow);

CREATE TABLE tickets_extra ENGINE=MyISAM
SELECT 
  t.id AS ticket_id,
  (
    SELECT m.manager_id
    FROM magic m
    WHERE DAYOFWEEK(t.created) = m.dow
    AND TIME(t.created) BETWEEN m.start AND m.end
    AND m.shop_id = t.shop_id
  ) AS manager_created,
  (
    SELECT m.manager_id
    FROM magic m
    WHERE DAYOFWEEK(t.resolved) = m.dow
    AND TIME(t.resolved) BETWEEN m.start AND m.end
    AND m.shop_id = t.shop_id
  ) AS manager_resolved
FROM tickets t;
DROP TABLE magic;

Lengthy Explanation

Now, I'll explain why this works, and my relative though process and steps to get here.

First, I knew the query I was trying was suffering because of the huge derived table, and the subsequent JOINs onto this. I was taking my well-indexed tickets table and joining all the shift_times data onto it, then letting MySQL chew on that while it attempts to join the shifts and shift_positions table. This derived behemoth would be up to a 2 million row unindexed mess.

Now, I knew this was happening. The reason I was going down this road though was because the "proper" way to do this, using strictly JOINs was taking an even longer amount of time. This is due to the nasty bit of chaos required to determine who the manager of a given shift is. I have to join down to shift_times to find out what the correct shift even is, while simultaneously joining down to shift_positions to figure out the user's level. I don't think the MySQL optimizer handles this very well, and ends up creating a HUGE monstrosity of a temporary table of the joins, then filtering out what doesn't apply.

So, as the derived table seemed to be the "way to go" I stubbornly persisted in this for a while. I tried punting it down into a JOIN clause, no improvement. I tried creating a temporary table with the derived table in it, but again it was too slow as the temp table was unindexed.

I came to realize that I had to handle this calculation of shift, times, positions sanely. I thought, maybe a VIEW would be the way to go. What if I created a VIEW that contained this information: (shop_id, shift_id, dow, start, end, manager_id). Then, I would simply have to join the tickets table by shop_id and the whole DAYOFWEEK/TIME calculation, and I'd be in business. Of course, I failed to remember that MySQL handles VIEWs rather assily. It doesn't materialize them at all, it simply runs the query you would have used to get the view for you. So by joining tickets onto this, I was essentially running my original query - no improvement.

So, instead of a VIEW I decided to use a TEMPORARY TABLE. This worked well if I only fetched one of the managers (created or resolved) at a time, but it was still pretty slow. Also, I found out that with MySQL you can't refer to the same table twice in the same query (I would have to join my temporary table twice to be able to differentiate between manager_created and manager_resolved). This is a big WTF, as I can do it as long as I don't specify "TEMPORARY" - this is where the CREATE TABLE magic ENGINE=MEMORY came into play.

With this pseudo temporary table in hand, I tried my JOIN for just manager_created again. It performed well, but still rather slow. Yet, when I JOINed again to get manager_resolved in the same query the query time ticked back up into the stratosphere. Looking at the EXPLAIN showed the full table scan of tickets (rows ~2mln), as expected, and the JOINs onto the magic table at ~2,087 each. Again, I seemed to be running into fail.

I now began to think about how to avoid the JOINs altogether and that's when I found some obscure ancient message board post where someone suggested using subselects (can't find the link in my history). This is what led to the second SELECT query shown above (the tickets_extra creation one). In the case of selecting just a single manager field, it performed well, but again with both it was crap. I looked at the EXPLAIN and saw this:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: t
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 173825
        Extra: 
*************************** 2. row ***************************
           id: 3
  select_type: DEPENDENT SUBQUERY
        table: m
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2037
        Extra: Using where
*************************** 3. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: m
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2037
        Extra: Using where
3 rows in set (0.00 sec)

Ack, the dreaded DEPENDENT SUBQUERY. It's often suggested to avoid these, as MySQL will usually execute them in an outside-in fashion, executing the inner query for every row of the outer. I ignored this, and wondered: "Well... what if I just indexed this stupid magic table?". Thus, the ADD index (shop_id, dow) was born.

Check this out:

mysql> CREATE TABLE magic ENGINE=MEMORY
<snip>
Query OK, 3220 rows affected (0.40 sec)

mysql> ALTER TABLE magic ADD INDEX (shop_id, dow);
Query OK, 3220 rows affected (0.02 sec)

mysql> CREATE TABLE tickets_extra ENGINE=MyISAM
<snip>
Query OK, 1933769 rows affected (24.18 sec)

mysql> drop table magic;
Query OK, 0 rows affected (0.00 sec)

Now THAT'S what I'm talkin' about!

Conclusion

This is definitely the first time I've created a non-TEMPORARY table on the fly, and INDEXed it on the fly, simply to do a single query efficiently. I guess I always assumed that adding an index on the fly is a prohibitively expensive operation. (Adding an index on my tickets table of 2mln rows can take over an hour). Yet, for a mere 3,000 rows this is a cakewalk.

Don't be afraid of DEPENDENT SUBQUERIES, creating TEMPORARY tables that really aren't, indexing on the fly, or aliens. They can all be good things in the right situation.

Thanks for all the help StackOverflow. :-D



回答2:

You should have used Postgres, lol. A simple query like this should not take more than some tens of seconds provided you have enough RAM to avoid disk thrashing.

Anyway.

=> Is the problem in the SELECT or the INSERT ?

(run the SELECT alone on a test server and time it).

=> Is your query disk bound or CPU bound ?

Launch it on a test server and check vmstat output. If it is CPU bound, skip this. If it is disk bound, check the working set size (ie the size of your database). If the working set is smaller than your RAM, it should not be disk bound. You can force loading of a table in the OS cache prior to executing a query by launching a dummy select like SELECT sum( some column ) FROM table. This can be useful if a query selects many rows in random order from a table which is not cached in RAM... you trigger a sequential scan of the table, which loads it in cache, then random access is much faster. With some trickery you can also cache indexes (or just tar your database directory to >/dev/null, lol).

Of course, adding more RAM could help (but you need to check if the query is killing the disk or the CPU first). Or telling MySQL to use more of your RAM in the configuration (key_buffer, etc).

If you are making millions of random HDD seeks you are in PAIN.

=> OK now the query

FIRST, ANALYZE your tables.

LEFT JOIN shift_positions ON su.shift_position_id = shift_positions.id WHERE shift_positions.level = 1

WHY do you LEFT JOIN and then add a WHERE on it ? The LEFT makes no sense. If there is no row in shift_positions, LEFT JOIN will generate a NULL, and the WHERE will reject it.

Solution : use JOIN instead of LEFT JOIN and move (level=1) in the JOIN ON() condition.

While you're at it, also get rid of the other LEFT JOIN (replace by JOIN) unless you are really interested in all those NULLs ? (I guess you are not).

Now you probably can get rid of the subselect.

Next.

WHERE TIME(t.created) BETWEEN shift_times.start AND shift_times.end)

This is not indexable, cause you have a function TIME() in the condition (use Postgres, lol). Lets look at it :

JOIN shift_times ON (shifts.id = shift_times.shift_id AND shift_times.dow = DAYOFWEEK(t.created) AND TIME(t.created) BETWEEN shift_times.start AND shift_times.end)

Ideally you would like to have a multicolumn index on shift_times(shift_id, DAYOFWEEK(t.created),TIME(t.created)) so this JOIN can be indexed.

Solution : add columns 'day','time' to shift_times, containing DAYOFWEEK(t.created),TIME(t.created), filled with correct values using a trigger firing on INSERT or UPDATE.

Now create multicolumn index on (shift_id,day,time)



回答3:

This will let you have read-only access for duration of changes:

create table_new (new schema);
insert into table_new select * from table order by primary_key_column;
rename table to table_old;
rename table_new to table;
-- recreate triggers if necessary

When inserting data to InnoDB tables it's crucial that you do this in primary key's order (otherwise with large datasets it's few orders of magnitude slower).



回答4:

About BETWEEN

SELECT * FROM a WHERE a.column BETWEEN x AND y 
  • is indexable and corresponds to a range lookup on index a.column (if you have one)
  • is 100% equivalent to a.column >= x AND a.column <= y

While this:

SELECT * FROM a WHERE somevalue BETWEEN a.column1 AND a.column2
  • is 100% equivalent to somevalue >= a.column1 AND somevalue <= a.column2
  • is a very different thing from the first one above
  • is not indexable by a range lookup (there is no range, you got 2 columns here)
  • generally leads to horrible query performance

I think there was confusion about this in the debate on "between" above.

OP has the first kind, so no worry.