i have this table (500,000 row)
CREATE TABLE IF NOT EXISTS `listings` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`type` tinyint(1) NOT NULL DEFAULT '1',
`hash` char(32) NOT NULL,
`source_id` int(10) unsigned NOT NULL,
`link` varchar(255) NOT NULL,
`short_link` varchar(255) NOT NULL,
`cat_id` mediumint(5) NOT NULL,
`title` mediumtext NOT NULL,
`description` mediumtext,
`content` mediumtext,
`images` mediumtext,
`videos` mediumtext,
`views` int(10) unsigned NOT NULL,
`comments` int(11) DEFAULT '0',
`comments_update` int(11) NOT NULL DEFAULT '0',
`editor_id` int(11) NOT NULL DEFAULT '0',
`auther_name` varchar(255) DEFAULT NULL,
`createdby_id` int(10) NOT NULL,
`createdon` int(20) NOT NULL,
`editedby_id` int(10) NOT NULL,
`editedon` int(20) NOT NULL,
`deleted` tinyint(1) NOT NULL,
`deletedon` int(20) NOT NULL,
`deletedby_id` int(10) NOT NULL,
`deletedfor` varchar(255) NOT NULL,
`published` tinyint(1) NOT NULL DEFAULT '1',
`publishedon` int(11) unsigned NOT NULL,
`publishedby_id` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
i'm thinking to make each query by the publishedon between x and y
(show in all the site just records of 1 month)
in the same time, i want to add with the publishedon
in the where clause published, cat_id , source_id
some thing like this:
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
that query is ok and fast until now without indexing, but when trying to use order by publishedon
its became too slow, so i used this index
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`
)
it worked and the order by publishedon
became fast, now i want to order by views
like this
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
this is the explanation
this query is too slow because of ORDER BY views DESC
then i'm tried to drop the old index and add this
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`,
`views` DESC
)
its too slow also
what about if i use just single index on publishedon
?
what about using single index on cat_id,source_id,views,publishedon?
i can change the query dependencies like publishedon in one month if i found other indexing method depend on any other columns
what about making index in (cat_id
, source_id
, publishedon
, published
) ? but in some cases i will use source_id only?
what is the best indexing schema for that table
If I were you, I'd at least
INDEX
the fields in question individually. You're building multi-column indices but it's clear you're pulling a lot of disparate records as well. Having the columns indexed individually can't hurt.Something you should do is use EXPLAIN which lets you look under the hood of how MySQL is pulling the data. It could further point to what is slowing your query down.
One important general note as to why your query isn't getting any faster despite your attempts is that
DESC
on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.
You have a few options, and all are actually pretty decent and can probably help you see significant improvement.
A note on SQL
First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.
It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.
absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.
maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using
EXPLAIN
as @Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be thepublished
column. Keep in mind that everyINDEX
adds to the work your queries need to do.Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.
Options, options, options...
There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.
This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your
[action]edon
,[action]edby_id
, and[action]ed
could easily be put into another table,actions
:The downside to this is that it does not allow you to ensure there is only one creation date without a
TRIGGER
. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only becreated
, but also by all of your other actions.Edit: As requested, here is a sample sorting query
Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!
If you put a composite key on
actiondate
andlistings.id
, this should help to increase speed.As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:
I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.
So another option is for you to split your integer function into a
month
,day
, andyear
field. You can still have your timestamp, but timestamps aren't all that great for searching. Run anEXPLAIN
on even a simple query and you'll see for yourself.That way, you can just index the month and year fields and do a query like this:
Slap an
EXPLAIN
in front and you should see massive improvements.Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.
Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.
This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.
The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.
There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.
The rows of your table are enormous (all those
mediumtext
columns), so sortingSELECT *
is going to have a lot of overhead. That's a simple reality of your schema design.SELECT *
is generally considered harmful to performance. If you can enumerate the columns you need, and you can leave out some of the big ones, you'll get better performance.You showed us a query with the following filter criteria
published
.publishedon
.cat_id
source_id
.Due to the way MySQL indexing works on MyISAM, the following compound covering index will probably serve you well. It's hard to be sure unless you try it.
To satisfy your query the MySQL engine will random-access the index at the appropriate value of
published
, and then at the begiining of thepublishedon
range. It will then scan through the index filtering on the other two filtering criteria. Finally, it sorts and and uses theid
value to look up each row that passes the filter. Give it a try.If that performance isn't good enough try this so-called
deferred join
operation.This does the heavy lifting of ordering with just the id and views columns without having to shuffle all those massive text columns. It may or may not help, because the ordering has to be repeated in the outer query. This kind of thing DEFINITELY helps when you have
ORDER BY ... LIMIT n
pattern in your query, but you don't.Finally, considering the size of these rows, you may get best performance by doing this inner query from your php program:
and then fetching the full rows of the table one-by-one using these
id
values in an inner loop. (This query that fetches just id values should be quite fast with the help of the index I mentioned.) The inner loop solution would be ugly, but if your text columns are really big (eachmediumtext
column can hold up to 16MiB) it's probably your best bet.tl;dr. Create the index mentioned. Get rid of
SELECT *
if you possibly can, giving a list of columns you need instead. Try the deferred join query. If it's still not good enough try the nested query.This query:
Is hard to optimize with only indexes. The best index is one that starts with
published
and then has the other columns -- it is not clear what their order should be. The reason is because all butpublished
are not using=
.Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the
WHERE
clause before the index can be used for theORDER BY
. That makes this hard to optimize.Suggestions . . . None are that great:
ORDER BY
faster, but won't help theORDER BY
.published
in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting.WHERE
clause or to return a smaller set of data.published
and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN
and so on) in the outer query. The subquery will use the index for sorting and then filter the results.The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.