Best data store for billions of rows-第2页回答

I need to be able to store small bits of data (approximately 50-75 bytes) for billions of records (~3 billion/month for a year).

The only requirement is fast inserts and fast lookups for all records with the same GUID and the ability to access the data store from .net.

I'm a SQL server guy and I think SQL Server can do this, but with all the talk about BigTable, CouchDB, and other nosql solutions, it's sounding more and more like an alternative to a traditional RDBS may be best due to optimizations for distributed queries and scaling. I tried cassandra and the .net libraries don't currently compile or are all subject to change (along with cassandra itself).

I've looked into many nosql data stores available, but can't find one that meets my needs as a robust production-ready platform.

If you had to store 36 billion small, flat records so that they're accessible from .net, what would choose and why?

标签： sql-server nosql

10条回答

地球回转人心会变

2楼-- · 2019-01-04 16:36

You can use MongoDB and use the guid as the sharding key, this means that you can distribute your data over multiple machines but the data you want to select is only on one machine because you select by the sharding key.

Sharding in MongoDb is not yet production ready.

0人赞添加讨论(0) 举报

别忘想泡老子

3楼-- · 2019-01-04 16:39

Storing ~3.5TB of data and inserting about 1K/sec 24x7, and also querying at a rate not specified, it is possible with SQL Server, but there are more questions:

what availability requirement you have for this? 99.999% uptime, or is 95% enough?
what reliability requirement you have? Does missing an insert cost you $1M?
what recoverability requirement you have? If you loose one day of data, does it matter?
what consistency requirement you have? Does a write need to be guaranteed to be visible on the next read?

If you need all these requirements I highlighted, the load you propose is going to cost millions in hardware and licensing on an relational system, any system, no matter what gimmicks you try (sharding, partitioning etc). A nosql system would, by their very definition, not meet all these requirements.

So obviously you have already relaxed some of these requirements. There is a nice visual guide comparing the nosql offerings based on the 'pick 2 out of 3' paradigm at Visual Guide to NoSQL Systems:

nosql comparisson

After OP comment update

With SQL Server this would e straight forward implementation:

one single table clustered (GUID, time) key. Yes, is going to get fragmented, but is fragmentation affect read-aheads and read-aheads are needed only for significant range scans. Since you only query for specific GUID and date range, fragmentation won't matter much. Yes, is a wide key, so non-leaf pages will have poor key density. Yes, it will lead to poor fill factor. And yes, page splits may occur. Despite these problems, given the requirements, is still the best clustered key choice.
partition the table by time so you can implement efficient deletion of the expired records, via an automatic sliding window. Augment this with an online index partition rebuild of the last month to eliminate the poor fill factor and fragmentation introduced by the GUID clustering.
enable page compression. Since the clustered key groups by GUID first, all records of a GUID will be next to each other, giving page compression a good chance to deploy dictionary compression.
you'll need a fast IO path for log file. You're interested in high throughput, not on low latency for a log to keep up with 1K inserts/sec, so stripping is a must.

Partitioning and page compression each require an Enterprise Edition SQL Server, they will not work on Standard Edition and both are quite important to meet the requirements.

As a side note, if the records come from a front-end Web servers farm, I would put Express on each web server and instead of INSERT on the back end, I would SEND the info to the back end, using a local connection/transaction on the Express co-located with the web server. This gives a much much better availability story to the solution.

So this is how I would do it in SQL Server. The good news is that the problems you'll face are well understood and solutions are known. that doesn't necessarily mean this is a better than what you could achieve with Cassandra, BigTable or Dynamo. I'll let someone more knowleageable in things no-sql-ish to argument their case.

Note that I never mentioned the programming model, .Net support and such. I honestly think they're irrelevant in large deployments. They make huge difference in the development process, but once deployed it doesn't matter how fast the development was, if the ORM overhead kills performance :)

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2019-01-04 16:42

You can try using Cassandra or HBase, though you would need to read up on how to design the column families as per your use case. Cassandra provides its own query language but you need to use Java APIs of HBase to access the data directly. If you need to use Hbase then I recommend querying the data with Apache Drill from Map-R which is an Open Source project. Drill's query language is SQL-Compliant(keywords in drill have the same meaning they would have in SQL).

0人赞添加讨论(0) 举报

▲ chillily

5楼-- · 2019-01-04 16:43

Amazon Redshift is a great service. It was not available when the question was originally posted in 2010, but it is now a major player in 2017. It is a column based database, forked from Postgres, so standard SQL and Postgres connector libraries will work with it.

It is best used for reporting purposes, especially aggregation. The data from a single table is stored on different servers in Amazon's cloud, distributed by on the defined table distkeys, so you rely on distributed CPU power.

So SELECTs and especially aggregated SELECTs are lightning fast. Loading large data should be preferably done with the COPY command from Amazon S3 csv files. The drawbacks are that DELETEs and UPDATEs are slower than usual, but that is why Redshift in not primarily a transnational database, but more of a data warehouse platform.

0人赞添加讨论(0) 举报

上一页 1 2

Best data store for billions of rows

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间