I'm working with a database schema that is running into scalability issues. One of the tables in the schema has grown to around 10 million rows, and I am exploring sharding and partitioning options to allow this schema to scale to much larger datasets (say, 1 billion to 100 billion rows). Our application must also be deployable onto several database products, including but not limited to Oracle, MS SQL Server, and MySQL.
This is a large problem in general, and I'd like to read up on what options are available. What resources are out there (books, whitepapers, web sites) for database sharding and partitioning strategies?
I agree with the other answers that you should look at your schema and indexes before resorting to sharding. 10 million rows is well within the capabilities of any of the major database engines.
However if you want some resources for learning about the subject of sharding then try these:
- Scalability Best Practices: Lessons from eBay
- Randy Shoup on eBay's Architectural Principles - Video and Presentation
- High Scalability Site
- Mr. Moore gets to punt on sharding (when not to do it)
I agree with Mike Woodhouse's observation that the current size should not be an issue - and the questioner agrees.
Most of the commercial DBMS provide support for fragmented tables in some for or another, under one name or several others. One of the key questions is whether there is a sensible way of splitting the data into fragments. One common way is to do so based on a date, so all the values for, say, November 2008 go in one fragment, those for October 2008 into another, and so on. This has advantages when it comes time to remove old data. You can probably drop the fragment containing data from October 2001 (seven years data retention) without affecting the other fragments. This sort of fragmentation can also help with 'fragment elimination'; if the query clearly cannot need to read the data from a given fragment, then it will be left unread, which can give you a magnificent performance benefit. (For example, if the optimizer knows that the query is for a date in October 2008, it will ignore all fragments except the one that contains the data from October 2008.)
There are other fragmentation techniques - round robin distributes the load across multiple disks, but means you cannot benefit from fragment elimination.
10 million rows is really not large in DBMS terms and I'd be looking first at my indexing and query plans before starting to plan a physical distribution of data with shards or partitions, which shouldn't really be necessary until your table's grown by a couple of orders of magnitude.
All IMHO, of course.
In my experience, large tables always hit you on the I/O side. The cheapest solution is to add enough multi-column indexes so that all your queries can get the data directly from index, without having to load the main data pages. This makes your inserts and updates more I/O intensive, but this may be OK.
The next easy option it max out RAM in your server. No reason to have less than 32GB if your database is big.
But at the end you will still find yourself I/O bound, and you'll be looking at buying a lot of hard drives and maintaining a complex partitioning scheme, which costs a fortune between hardware and labor. I hope that there is a better alternative these days - move the database from spinning hard drives to SLC solid state drives - this should make your random reads and writes a hundred times faster than top of the line SAS drives, and remove the I/O bottleneck. SSDs start at $10 per gigabyte, so you're going to spend a few grand but it's still much cheaper than SANs, etc.