Which database to choose (Cassandra, MongoDB, ?) f

In sql terms we're storing data like this:

table events (
  id
  timestamp
  dimension1
  dimension2
  dimension3
  etc.
)

All dimension values are integers. This table is becoming very large.

We want stupidly fast reads for queries like this:

SELECT dimension1, dimension2, COUNT(*) 
FROM   events
WHERE  dimension8 = 'foo'
AND    dimension9 = 'bar'
GROUP BY 1, 2

We want fast writes, and don't care about transactions and consistency. We care about eventual availability and partition tolerance.

I was looking at "NoSQL" alternatives. Can Casandra do the kind of queries I'm looking for?? This isn't immediately obvious from reading their docs... if it can do that, what is it's performance for those types of queries?

Was also looking at MongoDB, but their "group()" function has severe limitations as far as I could read (max of 10,000 rows).

Do you have experience with any of these databases, and would you recommend it as a solution to the problem described above?

Are there any other databases I should consider that can do these kind of queries fast?

Cheers, jimmy

标签： mongodb cassandra nosql

3条回答

Juvenile、少年°

2楼-- · 2019-03-25 13:45

"Group by" and "stupidly fast" do not go together. That's just the nature of that beast... Hence the limitations on Mongo's group operation; Cassandra doesn't even support it natively (although it does for Hive or Pig queries via Hadoop... but those are not intended to be stupidly fast).

Systems like Twitter's Rainbird (which uses Cassandra) doing realtime analytics do it by denormalizing/pre-computing the counts: http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011

0人赞添加讨论(0) 举报

男人必须洒脱

3楼-- · 2019-03-25 13:48

Was also looking at MongoDB, but their "group()" function has severe limitations as far as I could read (max of 10,000 rows).

To clarify, this is 10,000 rows returned. In your example, this will work for up to 10,000 combinations of dimension1/dimension2. If that's too large, then you can also use the slower Map / Reduce. Note that if you're running a query with more than 10k results, it may best to use Map / Reduce and save this data. 10k is a large query result to otherwise just "throw away".

Do you have experience with any of these databases, and would you recommend it as a solution to the problem described above?

Many people actually use MongoDB to do this type of summary "real-time", but they do it using "counters" instead of "aggregation". Instead of "rolling-up" detailed data, they'll do a regular insert and then they'll increment some counters.

In particular, using the atomic modifiers like $inc & $push to atomically update data in a single request.

Take a look at hummingbird for someone doing this right now. There's also an open source event-logging system backed by MongoDB: Graylog2. ServerDensity also does server event logging backed by MongoDB.

Looking at these may give you some inspiration for the types of logging you want to do.

0人赞添加讨论(0) 举报

冷血范

4楼-- · 2019-03-25 13:49

I started to go down this path for a similar purpose (metrics gathering and reporting), and here's where I ended up...

Getting the data in is the easy part. Getting the data out is the hard part.

If you have time and talent, you could learn and use a combination of open source tools as described here: http://kibana.org/infrastructure.html. The parts list:

Syslog-ng - Syslogd
Logstash - Powerful log pipeline
RabbitMQ or Redis - For queuing messages
Elasticsearch - Full text document storage and search
Graphite - From Orbitz, Scalable real-time graphing
Statsd - From Etsy, counts occurrences of fields and ships to graphite
Graphital - A ruby daemon to send host level performance data to graphite
Kibana - A browser based log analysis front end for Logstash and Elasticsearch

If you have more money than time, consider Splunk. It's expensive, but it's a good choice for a lot of situations. e.g. I'm in a situation where the client is extremely scarce on people, but they don't mind spending money, so splunk has been a good fit in that it's more of a turn-key solution than learning and stitching together a composite of tools.

0人赞添加讨论(0) 举报

Which database to choose (Cassandra, MongoDB, ?) f

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间