I'm working on a problem that requires caching paginated "search" results: Paginating very large datasets
The search works as follows: given an item_id, I find the matching item_ids and their rank.
I'm willing to concede not showing my users any results past, say, 500. After 500, I'm going to assume they're not going to find what they're looking for... the results are sorted in order of match anyway. So I want to cache these 500 results so I only have to do the heavy lifting of the query once, and users can still page the results (up to 500).
Now, suppose I use an intermediate MySQL table as my cache... that is, I store the top 500 results for each item in a "matches" table, like so: "item_id (INTEGER), matched_item_id (INTEGER), match_rank (REAL)". The search now becomes the extremely fast:
SELECT item.* FROM item, matches
WHERE matches.item_id=<item in question>
AND item.id=matches.matched_item_id
ORDER BY match_rank DESC
LIMIT x,y
I'd have no problem reindexing items and their matches into this table as they are requested by clients if the results are older than, say, 24 hours. Problem is, storing 500 results for N items (where N is ~100,000 to 1,000,000) this table becomes rather large... 50,000,000 - 500,000,000 rows.
Can MySQL handle this? What should I look out for?
MySQL can handle this many rows, and there are several techniques to scale when you are starting to hit the wall. Partioning and replication are the main solutions for this scenario.
You can also check additional scaling techniques for MySQL in a question I previously asked here on stackoverflow.
As others have said, MySQL can easily scale to accommodate very large data sets, and quite often it will handle large sets (a few million rows) without much intervention from the developer / dba beyond a little bit of sensible indexing and query optimisation. @doofledorer is correct in avoiding premature optimisation. As the 37 Signals guys say, If you app is a runway success and you're hitting database issues - well that's a great place to be in.
I would however, rebut this question with one of my own - do you really need to use MySQL as your caching system? There are plenty of places to hold a list of 500 ints, and my first choice would be server side in the session. Even if the session data is written to disc, loading that array of 500 ints isn't going to be that slow - and there are plenty of strategies for using in-memory caches (such as MemCache) to speed that up further.
Looping through your session stored array and performing 10, 20 (or however manyper page), individual queries along the lines of "select item.* where id = X" might sound scary - certainly it's going to raise the physical number of queries, but it's going to be lightening quick, especially with a touch of MySQL query caching thrown in.
Edit: Sam's Comments highlighted something I forgot: If you utilise say a session based approach, you immeadiatly gain benefits from the fact that session is state based. You don't have to worry about purging expired data - when the session ends, poof, it's gone. And, if you stick with disc based sessions (I'm working on the assumption of PHP as the server-side language here), then remember disc space is incredibly cheap.
At the end of the day it becomes a trade-off between ease of use (in development / maintanence terms), scalability and performance. I would just say that you shoudl bare in mind that just because you are dealing with the results of a database query, it doesn't mean that a database is the best method of storing those results in all cases - keep an open mind!
Agreed with above. Be very careful to avoid premature optimization by denormalization here.
Don't use "SELECT *". More fields means more disk reads.
Make sure you are using covering indexes - i.e. you can get all your requested field values from the index without going to the data table. Double check that you aren't reading record data.
Test, Test Test.
If possible, use a write-only table (i.e. no updates and no deletes) so mysql isn't reusing deleted spaces and refilling indexes.
Make sure indexed fields are as short as possible (but no shorter.)
EDIT: Some more things have come to mind ...
Standard (and fastest) MyISAM table types don't have any way to maintain records in any sequence other than insertion order (modified by filling deleted rows) - i.e. no clustered indexes. But you can fake it if you periodically copy/rebuild the table based on an index that's useful for grouping associated records in the same page. Sure new records won't comply, but 98% of table efficiency is better than the default.
Become intimately familiar with the configuration settings, especially cache sizes. In fact, to make it easy, don't worry about any other settings than the cache sizes (and understand what they each are).
Become intimately familiar with the info in the stats log, as it applies to the effectiveness of the config cache settings.
Run the "slow query log" all the time. It's low overhead, and it's the first stop in all remediation.
This goes without saying, but don't run anything but the database on the same server. One big reason is to be able to optimize resources for the database only.
DO NOT denormalize until things are about to fall apart.
Non-negotiables.
Everything above this line is questionable advice. Never take any advice without understanding it and testing it. There are two sides to every design decision; and the MySQL online advice is worse than average at making generalizations without qualification and without scaling the benefits and penalties. Question everything I've noted here, as well. Understand what you're doing, why you are doing it, and what advantages you expect to get. Measure the changes to see if what was expected was what happened.
Never, never "try some things to see what happens". Doing so is like tuning a car with multiple carburetors, except worse. If what you expected didn't happen, back out the change and either figure it out, or work on something else you do understand. Sleep is your friend; much of this will come to you overnight after hard sessions of testing.
You'll never understand it all; you'll always need to learn more than you know. Always ask "Why" and "What's your proof". (Often it's something someone read that doesn't apply to your situation.)
MySQL can handle it. The real question is: can it handle it in a reasonable amount of time? That depends on your query. Like Eran Galperin said in his answer, look into partitioning and replication for optimization.