When should I use a composite index?

2019-01-01 14:34发布

  1. When should I use a composite index in a database?
  2. What are the performance ramification by using a composite index)?
  3. Why should I use use a composite index?

For example, I have a homes table:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  PRIMARY KEY  (`home_id`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
) ENGINE=InnoDB  ;

Does it make sense for me to use a composite index for both geolat and geolng, such that:

I replace:

  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),

with:

KEY `geolat_geolng` (`geolat`, `geolng`)

If so:

  • Why?
  • What is the performance ramification by using a composite index)?

UPDATE:

Since many people have stated it entirely dependent upon the queries I perform, below is the most common query performed:

SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

UPDATE 2:

With the following database schema:

CREATE TABLE IF NOT EXISTS `homes` (
  `home_id` int(10) unsigned NOT NULL auto_increment,
  `primary_photo_group_id` int(10) unsigned NOT NULL default '0',
  `customer_id` bigint(20) unsigned NOT NULL,
  `account_type_id` int(11) NOT NULL,
  `address` varchar(128) collate utf8_unicode_ci NOT NULL,
  `city` varchar(64) collate utf8_unicode_ci NOT NULL,
  `state` varchar(2) collate utf8_unicode_ci NOT NULL,
  `zip` mediumint(8) unsigned NOT NULL,
  `price` mediumint(8) unsigned NOT NULL,
  `sqft` smallint(5) unsigned NOT NULL,
  `year_built` smallint(5) unsigned NOT NULL,
  `num_of_beds` tinyint(3) unsigned NOT NULL,
  `num_of_baths` decimal(3,1) unsigned NOT NULL,
  `num_of_floors` tinyint(3) unsigned NOT NULL,
  `description` text collate utf8_unicode_ci,
  `geolat` decimal(10,6) default NULL,
  `geolng` decimal(10,6) default NULL,
  `display_status` tinyint(1) NOT NULL,
  `date_listed` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `contact_email` varchar(100) collate utf8_unicode_ci NOT NULL,
  `contact_phone_number` varchar(15) collate utf8_unicode_ci NOT NULL,
  PRIMARY KEY  (`home_id`),
  KEY `customer_id` (`customer_id`),
  KEY `city` (`city`),
  KEY `num_of_beds` (`num_of_beds`),
  KEY `num_of_baths` (`num_of_baths`),
  KEY `geolat` (`geolat`),
  KEY `geolng` (`geolng`),
  KEY `account_type_id` (`account_type_id`),
  KEY `display_status` (`display_status`),
  KEY `sqft` (`sqft`),
  KEY `price` (`price`),
  KEY `primary_photo_group_id` (`primary_photo_group_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=8 ;

Using the following SQL:

EXPLAIN SELECT  homes.home_id,
                    address,
                    city,
                    state,
                    zip,
                    price,
                    sqft,
                    year_built,
                    account_type_id,
                    num_of_beds,
                    num_of_baths,
                    geolat,
                    geolng,
                    photo_id,
                    photo_url_dir
            FROM homes
            LEFT OUTER JOIN home_photos ON homes.home_id = home_photos.home_id
                AND homes.primary_photo_group_id = home_photos.home_photo_group_id
                AND home_photos.home_photo_type_id = 2
            WHERE homes.display_status = true
            AND homes.geolat BETWEEN -100 AND 100
            AND homes.geolng BETWEEN -100 AND 100

EXPLAIN returns:

id  select_type  table        type  possible_keys                                    key                  key_len  ref     rows  Extra
----------------------------------------------------------------------------------------------------------
1   SIMPLE       homes        ref   geolat,geolng,display_status                     display_status       1        const   2     Using where
1  SIMPLE        home_photos  ref   home_id,home_photo_type_id,home_photo_group_id   home_photo_group_id  4        homes.primary_photo_group_id   4  

I don't quite understand how to read the EXPLAIN command. Does this look good or bad. Right now, I am NOT using a composite index for geolat and geolng. Should I be?

9条回答
情到深处是孤独
2楼-- · 2019-01-01 15:33

I'm with @Mitch, depends entirely your queries. Fortunately you can create and drop indexes at any time, and you can prepend the EXPLAIN keyword to your queries to see if the query analyzer uses the indexes.

If you'll be looking up an exact lat/long pair this index would likely make sense. But you're probably going to be looking for homes within a certain distance of a particular place, so your queries will look something like this (see source):

select *, sqrt(  pow(h2.geolat - h1.geolat,  2) 
               + pow(h2.geolng - h1.geolng, 2) ) as distance
from homes h1, homes h2
where h1.home_id = 12345 and h2.home_id != h1.home_id
order by distance

and the index very likely won't be helpful at all. For geospatial queries, you need something like this.

Update: with this query:

SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???

The query analyzer could use an index on geolat alone, or an index on geolng alone, or possibly both indexes. I don't think it would use a composite index. But it's easy to try out each of these permutations on a real data set and then (a) see what EXPLAIN tells you and (b) measure the time the query really takes.

查看更多
零度萤火
3楼-- · 2019-01-01 15:36

There could be a misconception about what composite index does. Many people think that composite index can be used to optimise a search query as long as the where clause covers the indexed columns, in your case geolat and geolng. Let's delve deeper:

I believe your data on the coordinates of homes would be random decimals as such:

home_id  geolat  geolng
   1    20.1243  50.4521
   2    22.6456  51.1564
   3    13.5464  45.4562
   4    55.5642 166.5756
   5    24.2624  27.4564
   6    62.1564  24.2542
...

Since geolat and geolng values hardly repeat itself. A composite index on geolat and geolng would look something like this:

index_id  geolat  geolng
   1     20.1243  50.4521
   2     20.1244  61.1564
   3     20.1251  55.4562
   4     20.1293  66.5756
   5     20.1302  57.4564
   6     20.1311  54.2542
...

Therefore the second column of the composite index is basically useless! The speed of your query with a composite index is probably going to be similar to an index on just the geolat column.

As mentioned by Will, MySQL provides spatial extension support. A spatial point is stored in a single column instead of two separate lat lng columns. Spatial index can be applied to such a column. However, the efficiency could be overrated based on my personal experience. It could be that spatial index does not resolve the two dimensional problem but merely speed up the search using R-Trees with quadratic splitting.

The trade-off is that a spatial point consumes much more memory as it used eight-byte double-precision numbers for storing coordinates. Correct me if I am wrong.

查看更多
看淡一切
4楼-- · 2019-01-01 15:36

There is no Black and White, one size fits all answer.

You should use a composite index, when your query work load would benefit from one.

You need to profile your query work load in order to determine this.

A composite index comes into play when queries can be satisfied entirely from that index.

UPDATE (in response to edit to posted question): If you are selecting * from the table the composite index may be used, it may not. You will need to run EXPLAIN PLAN to be sure.

查看更多
登录 后发表回答