Best data store for billions of rows

2019-01-04 16:11发布

I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).

The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net.

I'm a SQL server guy and I think SQL Server can do this, but with all the talk about BigTable, CouchDB, and other nosql solutions, it's sounding more and more like an alternative to a traditional RDBS may be best due to optimizations for distributed queries and scaling. I tried cassandra and the .net libraries don't currently compile or are all subject to change (along with cassandra itself).

I've looked into many nosql data stores available, but can't find one that meets my needs as a robust production-ready platform.

If you had to store 36 billion small, flat records so that they're accessible from .net, what would choose and why?

10条回答
孤傲高冷的网名
2楼-- · 2019-01-04 16:19

The following article discusses the import and use of a 16 billion row table in Microsoft SQL. http://sqlmag.com/t-sql/adventures-big-data-how-import-16-billion-rows-single-table.

From the article:

Here are some distilled tips from my experience:

  • The more data you have in a table with a defined clustered index, the slower it becomes to import unsorted records into it. At some point, it becomes too slow to be practical.
  • If you want to export your table to the smallest possible file, make it native format. This works best with tables containing mostly numeric columns because they’re more compactly represented in binary fields than character data. If all your data is alphanumeric, you won’t gain much by exporting it in native format. Not allowing nulls in the numeric fields can further compact the data. If you allow a field to be nullable, the field’s binary representation will contain a 1-byte prefix indicating how many bytes of data will follow.
  • You can’t use BCP for more than 2,147,483,647 records because the BCP counter variable is a 4-byte integer. I wasn’t able to find any reference to this on MSDN or the Internet. If your table consists of
    more than 2,147,483,647 records, you’ll have to export it in chunks
    or write your own export routine.
  • Defining a clustered index on a prepopulated table takes a lot of disk space. In my test, my log exploded to 10 times the original
    table size before completion.
  • When importing a large number of records using the BULK INSERT statement, include the BATCHSIZE parameter and specify how many
    records to commit at a time. If you don’t include this parameter,
    your entire file is imported as a single transaction, which
    requires a lot of log space.
  • The fastest way of getting data into a table with a clustered index is to presort the data first. You can then import it using the BULK
    INSERT statement with the ORDER parameter.
查看更多
兄弟一词,经得起流年.
3楼-- · 2019-01-04 16:25

"I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).

The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net."

I can tell you from experience that this is possible in SQL Server, because I have done it in early 2009 ... and it's still operation to this day and quite fast.

The table was partitioned in 256 partitions, keep in mind this was 2005 SQL version ... and we did exactly what you're saying, and that is to store bits of info by GUID and retrieve by GUID quickly.

When i left we had around 2-3 billion records, and data retrieval was still quite good (1-2 seconds if get through UI, or less if on RDBMS) even though the data retention policy was just about to be instantiated.

So, long story short, I took the 8th char (i.e. somewhere in the middle-ish) from the GUID string and SHA1 hashed it and cast as tiny int (0-255) and stored in appropriate partition and used same function call when getting the data back.

ping me if you need more info...

查看更多
做自己的国王
4楼-- · 2019-01-04 16:25

Store records in plain binary files, one file per GUID, wouldn't get any faster than that.

查看更多
劳资没心,怎么记你
5楼-- · 2019-01-04 16:35

Contrary to popular belief, NoSQL is not about performance, or even scalability. It's mainly about minimizing the so-called Object-Relational impedance mismatch, but is also about horizontal scalability vs. the more typical vertical scalability of an RDBMS.

For the simple requirement of fasts inserts and fast lookups, almost any database product will do. If you want to add relational data, or joins, or have any complex transactional logic or constraints you need to enforce, then you want a relational database. No NoSQL product can compare.

If you need schemaless data, you'd want to go with a document-oriented database such as MongoDB or CouchDB. The loose schema is the main draw of these; I personally like MongoDB and use it in a few custom reporting systems. I find it very useful when the data requirements are constantly changing.

The other main NoSQL option is distributed Key-Value Stores such as BigTable or Cassandra. These are especially useful if you want to scale your database across many machines running commodity hardware. They work fine on servers too, obviously, but don't take advantage of high-end hardware as well as SQL Server or Oracle or other database designed for vertical scaling, and obviously, they aren't relational and are no good for enforcing normalization or constraints. Also, as you've noticed, .NET support tends to be spotty at best.

All relational database products support partitioning of a limited sort. They are not as flexible as BigTable or other DKVS systems, they don't partition easily across hundreds of servers, but it really doesn't sound like that's what you're looking for. They are quite good at handling record counts in the billions, as long as you index and normalize the data properly, run the database on powerful hardware (especially SSDs if you can afford them), and partition across 2 or 3 or 5 physical disks if necessary.

If you meet the above criteria, if you're working in a corporate environment and have money to spend on decent hardware and database optimization, I'd stick with SQL Server for now. If you're pinching pennies and need to run this on low-end Amazon EC2 cloud computing hardware, you'd probably want to opt for Cassandra or Voldemort instead (assuming you can get either to work with .NET).

查看更多
神经病院院长
6楼-- · 2019-01-04 16:35

There is an unusual fact that seems to overlooked.

"Basically after inserting 30Mil rows in a day, I need to fetch all the rows with the same GUID (maybe 20 rows) and be reasonably sure I'd get them all back"

Needing only 20 columns, a non-clustered index on the GUID will work just fine. You could cluster on another column for data dispersion across partitions.

I have a question regarding the data insertion: How is it being inserted?

  • Is this a bulk insert on a certain schedule (per min, per hour, etc)?
  • What source is this data being pulled from (flat files, OLTP, etc)?

I think these need to be answered to help understand one side of the equation.

查看更多
劳资没心,怎么记你
7楼-- · 2019-01-04 16:36

Very few people work at the multi-billion row set size, and most times that I see a request like this on stack overflow, the data is no where near the size it is being reported as.

36 billion, 3 billion per month, thats roughly 100 million per day, 4.16 million an hour, ~70k rows per minute, 1.1k rows a second coming into the system, in a sustained manner for 12 months, assuming no down time.

Those figures are not impossible by a long margin, i've done larger systems, but you want to double check that is really the quantities you mean - very few apps really have this quantity.

In terms of storing / retrieving and quite a critical aspect you have not mentioned is aging the older data - deletion is not free.

The normal technology is look at is partitioning, however, the lookup / retrieval being GUID based would result in a poor performance, assuming you have to get every matching value across the whole 12 month period. You could place a clustered indexes on the GUID column will get your associated data clusterd for read / write, but at those quantities and insertion speed, the fragmentation will be far too high to support, and it will fall on the floor.

I would also suggest that you are going to need a very decent hardware budget if this is a serious application with OLTP type response speeds, that is by some approximate guesses, assuming very few overheads indexing wise, about 2.7TB of data.

In the SQL Server camp, the only thing that you might want to look at is the new parrallel data warehouse edition (madison) which is designed more for sharding out data and running parallel queries against it to provide high speed against large datamarts.

查看更多
登录 后发表回答