MySQL indexes - what are the best practices accord

2020-07-06 06:56发布

问题:

i have this table (500,000 row)

CREATE TABLE IF NOT EXISTS `listings` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `type` tinyint(1) NOT NULL DEFAULT '1',
  `hash` char(32) NOT NULL,
  `source_id` int(10) unsigned NOT NULL,
  `link` varchar(255) NOT NULL,
  `short_link` varchar(255) NOT NULL,
  `cat_id` mediumint(5) NOT NULL,
  `title` mediumtext NOT NULL,
  `description` mediumtext,
  `content` mediumtext,
  `images` mediumtext,
  `videos` mediumtext,
  `views` int(10) unsigned NOT NULL,
  `comments` int(11) DEFAULT '0',
  `comments_update` int(11) NOT NULL DEFAULT '0',
  `editor_id` int(11) NOT NULL DEFAULT '0',
  `auther_name` varchar(255) DEFAULT NULL,
  `createdby_id` int(10) NOT NULL,
  `createdon` int(20) NOT NULL,
  `editedby_id` int(10) NOT NULL,
  `editedon` int(20) NOT NULL,
  `deleted` tinyint(1) NOT NULL,
  `deletedon` int(20) NOT NULL,
  `deletedby_id` int(10) NOT NULL,
  `deletedfor` varchar(255) NOT NULL,
  `published` tinyint(1) NOT NULL DEFAULT '1',
  `publishedon` int(11) unsigned NOT NULL,
  `publishedby_id` int(10) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `hash` (`hash`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

i'm thinking to make each query by the publishedon between x and y (show in all the site just records of 1 month)

in the same time, i want to add with the publishedon in the where clause published, cat_id , source_id

some thing like this:

SELECT * FROM listings 
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5))

that query is ok and fast until now without indexing, but when trying to use order by publishedon its became too slow, so i used this index

CREATE INDEX `listings_pcs` ON listings(
    `publishedon` DESC,
    `published` ,
    `cat_id` ,
    `source_id`
)

it worked and the order by publishedon became fast, now i want to order by views like this

SELECT * FROM listings 
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5)) 
ORDER BY views DESC

this is the explanation this query is too slow because of ORDER BY views DESC

then i'm tried to drop the old index and add this

CREATE INDEX `listings_pcs` ON listings(
    `publishedon` DESC,
    `published` ,
    `cat_id` ,
    `source_id`,
    `views` DESC
)

its too slow also

what about if i use just single index on publishedon? what about using single index on cat_id,source_id,views,publishedon?

i can change the query dependencies like publishedon in one month if i found other indexing method depend on any other columns

what about making index in (cat_id, source_id, publishedon, published) ? but in some cases i will use source_id only?

what is the best indexing schema for that table

回答1:

One important general note as to why your query isn't getting any faster despite your attempts is that DESC on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.

In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.

You have a few options, and all are actually pretty decent and can probably help you see significant improvement.

A note on SQL

First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.

It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.

absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.

maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using EXPLAIN as @Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be the published column. Keep in mind that every INDEX adds to the work your queries need to do.

Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.

Options, options, options...

There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.

  1. Split your SQL table into two or more separate tables.

This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your [action]edon, [action]edby_id, and [action]ed could easily be put into another table, actions:

+-----------+-------------+------+-----+-------------------+----------------+
| Field     | Type        | Null | Key | Default           | Extra          |
+-----------+-------------+------+-----+-------------------+----------------+
| id        | int(11)     | NO   | PRI | NULL              | auto_increment |
| action_id | int(11)     | NO   |     | NULL              |                |
| action    | varchar(45) | NO   |     | NULL              |                |
| date      | datetime    | NO   |     | CURRENT_TIMESTAMP |                |
| user_id   | int(11)     | NO   |     | NULL              |                |
+-----------+-------------+------+-----+-------------------+----------------+

The downside to this is that it does not allow you to ensure there is only one creation date without a TRIGGER. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only be created, but also by all of your other actions.

Edit: As requested, here is a sample sorting query

SELECT * FROM listings 
INNER JOIN actions ON actions.listing_id = listings.id
WHERE (actions.action = 'published') 
  AND (listings.published = 1) 
  AND (listings.cat_id in(1,2,3,4,5)) 
  AND (listings.source_id  in(1,2,3,4,5)) 
  AND (actions.actiondate between 1441105258 AND 1443614458)
ORDER BY listings.views DESC

Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!

If you put a composite key on actiondate and listings.id, this should help to increase speed.

As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:

  1. Create a month field

I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.

So another option is for you to split your integer function into a month, day, and year field. You can still have your timestamp, but timestamps aren't all that great for searching. Run an EXPLAIN on even a simple query and you'll see for yourself.

That way, you can just index the month and year fields and do a query like this:

SELECT * FROM listings 
WHERE (publishedmonth = 9)
  AND (publishedyear = 2015) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5)) 
ORDER BY views DESC

Slap an EXPLAIN in front and you should see massive improvements.

Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.

Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.

  1. Do your heavy sorting elsewhere

This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.

The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.

There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.



回答2:

This query:

SELECT *
FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) AND
      (published = 1) AND
      (cat_id in (1,2,3,4,5)) AND
      (source_id in (1,2,3,4,5));

Is hard to optimize with only indexes. The best index is one that starts with published and then has the other columns -- it is not clear what their order should be. The reason is because all but published are not using =.

Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the WHERE clause before the index can be used for the ORDER BY. That makes this hard to optimize.

Suggestions . . . None are that great:

  • If you are going to access the data by month, then you might consider partitioning the data by month. That will make the query without the ORDER BY faster, but won't help the ORDER BY.
  • Try various orders of columns after published in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting.
  • Think about ways that you can structure the query to have more equality conditions in the WHERE clause or to return a smaller set of data.
  • (Not really recommended) Put an index on published and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN and so on) in the outer query. The subquery will use the index for sorting and then filter the results.

The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.



回答3:

If I were you, I'd at least INDEX the fields in question individually. You're building multi-column indices but it's clear you're pulling a lot of disparate records as well. Having the columns indexed individually can't hurt.

Something you should do is use EXPLAIN which lets you look under the hood of how MySQL is pulling the data. It could further point to what is slowing your query down.

EXPLAIN SELECT * FROM listings 
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) 
  AND (published = 1) 
  AND (cat_id in(1,2,3,4,5)) 
  AND (source_id  in(1,2,3,4,5)) 
ORDER BY views DESC


回答4:

The rows of your table are enormous (all those mediumtext columns), so sorting SELECT * is going to have a lot of overhead. That's a simple reality of your schema design. SELECT * is generally considered harmful to performance. If you can enumerate the columns you need, and you can leave out some of the big ones, you'll get better performance.

You showed us a query with the following filter criteria

  1. single-value equality on published.
  2. range matching on publishedon.
  3. set matching on cat_id
  4. set matching on source_id.
  5. Ordering on views.

Due to the way MySQL indexing works on MyISAM, the following compound covering index will probably serve you well. It's hard to be sure unless you try it.

CREATE INDEX listings_x_pub_date_cover ON listings( 
     published, publishedon, cat_id, source_id, views, id )

To satisfy your query the MySQL engine will random-access the index at the appropriate value of published, and then at the begiining of the publishedon range. It will then scan through the index filtering on the other two filtering criteria. Finally, it sorts and and uses the id value to look up each row that passes the filter. Give it a try.

If that performance isn't good enough try this so-called deferred join operation.

SELECT a.*
  FROM listings a
  JOIN ( SELECT id, views
           FROM listings
          WHERE published = 1
            AND publishedon BETWEEN 1441105258
                                AND 1443614458
            AND cat_id IN (1,2,3,4,5)
            AND source_id IN (1,2,3,4,5)
          ORDER BY views DESC
       ) b ON a.id = b.id
 ORDER BY b.views DESC

This does the heavy lifting of ordering with just the id and views columns without having to shuffle all those massive text columns. It may or may not help, because the ordering has to be repeated in the outer query. This kind of thing DEFINITELY helps when you have ORDER BY ... LIMIT n pattern in your query, but you don't.

Finally, considering the size of these rows, you may get best performance by doing this inner query from your php program:

         SELECT id
           FROM listings
          WHERE published = 1
            AND publishedon BETWEEN 1441105258
                                AND 1443614458
            AND cat_id IN (1,2,3,4,5)
            AND source_id IN (1,2,3,4,5)
          ORDER BY views DESC

and then fetching the full rows of the table one-by-one using these id values in an inner loop. (This query that fetches just id values should be quite fast with the help of the index I mentioned.) The inner loop solution would be ugly, but if your text columns are really big (each mediumtext column can hold up to 16MiB) it's probably your best bet.

tl;dr. Create the index mentioned. Get rid of SELECT * if you possibly can, giving a list of columns you need instead. Try the deferred join query. If it's still not good enough try the nested query.