Is cross-table indexing possible?

2020-02-08 05:56发布

问题:

Consider a structure where you have a many-to-one (or one-to-many) relationship with a condition (where, order by, etc.) on both tables. For example:

CREATE TABLE tableTwo (
    id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    eventTime DATETIME NOT NULL,
    INDEX (eventTime)
) ENGINE=InnoDB;

CREATE TABLE tableOne (
    id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    tableTwoId INT UNSIGNED NOT NULL,
    objectId INT UNSIGNED NOT NULL,
    INDEX (objectID),
    FOREIGN KEY (tableTwoId) REFERENCES tableTwo (id)
) ENGINE=InnoDB;

and for an example query:

select * from tableOne t1 
  inner join tableTwo t2 on t1.tableTwoId = t2.id
  where objectId = '..'
  order by eventTime;

Let's say you index tableOne.objectId and tableTwo.eventTime. If you then explain on the above query, it will show "Using filesort". Essentially, it first applies the tableOne.objectId index, but it can't apply the tableTwo.eventTime index because that index is for the entirety of tableTwo (not the limited result set), and thus it must do a manual sort.

Thus, is there a way to do a cross-table index so it wouldn't have to filesort each time results are retrieved? Something like:

create index ind_t1oi_t2et on tableOne t1 
  inner join tableTwo t2 on t1.tableTwoId = t2.id 
  (t1.objectId, t2.eventTime);

Also, I've looked into creating a view and indexing that, but indexing is not supported for views.

The solution I've been leaning towards if cross-table indexing isn't possible is replicating the conditional data in one table. In this case that means eventTime would be replicated in tableOne and a multi-column index would be set up on tableOne.objectId and tableOne.eventTime (essentially manually creating the index). However, I thought I'd seek out other people's experience first to see if that was the best way.

Thanks much!

Update:

Here are some procedures for loading test data and comparing results:

drop procedure if exists populate_table_two;
delimiter #
create procedure populate_table_two(IN numRows int)
begin
declare v_counter int unsigned default 0;
  while v_counter < numRows do
    insert into tableTwo (eventTime) 
    values (CURRENT_TIMESTAMP - interval 0 + floor(0 + rand()*1000) minute);
    set v_counter=v_counter+1;
  end while;
end #
delimiter ;

drop procedure if exists populate_table_one;
delimiter #
create procedure populate_table_one
   (IN numRows int, IN maxTableTwoId int, IN maxObjectId int)
begin
declare v_counter int unsigned default 0;
  while v_counter < numRows do
    insert into tableOne (tableTwoId, objectId) 
      values (floor(1 +(rand() * maxTableTwoId)), 
              floor(1 +(rand() * maxObjectId)));
    set v_counter=v_counter+1;
  end while;
end #
delimiter ;

You can use these as follows to populate 10,000 rows in tableTwo and 20,000 rows in tableOne (with random references to tableOne and random objectIds between 1 and 5), which took 26.2 and 70.77 seconds respectively to run for me:

call populate_table_two(10000);
call populate_table_one(20000, 10000, 5);

Update 2 (Tested Triggering SQL):

Below is the tried and tested SQL based on daniHp's triggering method. This keeps the dateTime in sync on tableOne when tableOne is added or tableTwo is updated. Also, this method should also work for many-to-many relationships if the condition columns are copied to the joining table. In my testing of 300,000 rows in tableOne and 200,000 rows in tableTwo, the speed of the old query with similar limits was 0.12 sec and the speed of the new query still shows as 0.00 seconds. Thus, there is a clear improvement, and this method should perform well into the millions of rows and farther.

alter table tableOne add column tableTwo_eventTime datetime;

create index ind_t1_oid_t2et on tableOne (objectId, tableTwo_eventTime);

drop TRIGGER if exists t1_copy_t2_eventTime;
delimiter #
CREATE TRIGGER t1_copy_t2_eventTime
   BEFORE INSERT ON tableOne
for each row
begin
  set NEW.tableTwo_eventTime = (select eventTime 
       from tableTwo t2
       where t2.id = NEW.tableTwoId);
end #
delimiter ;

drop TRIGGER if exists upd_t1_copy_t2_eventTime;
delimiter #
CREATE TRIGGER upd_t1_copy_t2_eventTime
   BEFORE UPDATE ON tableTwo
for each row
begin
  update tableOne 
    set tableTwo_eventTime = NEW.eventTime 
    where tableTwoId = NEW.id;
end #
delimiter ;

And the updated query:

select * from tableOne t1 
  inner join tableTwo t2 on t1.tableTwoId = t2.id
  where t1.objectId = 1
  order by t1.tableTwo_eventTime desc limit 0,10;

回答1:

As you know, SQLServer achieves this with indexed views:

indexed views provide additional performance benefits that cannot be achieved using standard indexes. Indexed views can increase query performance in the following ways:

Aggregations can be precomputed and stored in the index to minimize expensive computations during query execution.

Tables can be prejoined and the resulting data set stored.

Combinations of joins or aggregations can be stored.

In SQLServer, to take advantage of this technique, you must query over the view and not over the tables. That means that you should know about the view and indexes.

MySQL does not have indexed views, but you can simulate the behavior with table + triggers + indexes.

Instead of creating a view, you must create an indexed table, a trigger to keep the data table up to date, and then you must query your new table instead of your normalized tables.

You must evaluate if the overhead of write operations offsets the improvement in read operations.

Edited:

Note that it is not always necessary to create a new table. For example, in a 1:N relationship (master-detail) trigger, you can keep a copy of a field from the 'master' table into the 'detail' table. In your case:

CREATE TABLE tableOne (
    id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
    tableTwoId INT UNSIGNED NOT NULL,
    objectId INT UNSIGNED NOT NULL,
    desnormalized_eventTime DATETIME NOT NULL,
    INDEX (objectID),
    FOREIGN KEY (tableTwoId) REFERENCES tableTwo (id)
) ENGINE=InnoDB;

CREATE TRIGGER tableOne_desnormalized_eventTime
   BEFORE INSERT ON tableOne
for each row
begin
  DECLARE eventTime DATETIME;
  SET eventTime = 
      (select eventTime 
       from tableOne
       where tableOne.id = NEW.tableTwoId);
  NEW.desnormalized_eventTime = eventTime;
end;

Notice that this is a before insert trigger.

Now, the query is rewritten as follows:

select * from tableOne t1 
  inner join tableTwo t2 on t1.tableTwoId = t2.id
  where t1.objectId = '..'
  order by t1.desnormalized_eventTime;

Disclaimer: not tested.



回答2:

Cross-table indexing is not possible in MySQL except via the now-defunct Akiban(?) Engine.

I have a rule: "Do not normalize 'continuous' values such as INTs, FLOATs, DATETIMEs, etc." The cost of the JOIN when you need to sort or range-test on the continuous value will kill performance.

DATETIME takes 5 bytes; INT takes 4. So any 'space' argument toward normalizing a datetime is rather poor. It is rare that you would need to 'normalize' a datetime in the off chance that all uses of a particular value were to change.



回答3:

May be I'm wrong , but if this is my application I will not duplicate the data unless I need to order by 2 columns in 2 different tables and this is a hot query (it's required many times). But since there is no clear cut solution to avoid the filesort, what about this little trick (force the optimizer to use the index on the column in the order by clause eventTime)

select * from tableOne t1 
inner join tableTwo t2 use index (eventTime)  on t1.tableTwoId = t2.id and t2.eventTime > 0
where t1.objectId = 1
order by t2.eventTime desc limit 0,10;

notice use index (eventTime) and t2.eventTime > 0

It's explain shows that the optimizer has used the index on eventTime instead of filesort

1   SIMPLE  t2  range   eventTime   eventTime   5       5000    Using where; Using index
1   SIMPLE  t1  ref objectId,tableTwoId tableTwoId  4   tests.t2.id 1   Using where