Strategies for Real-Time Aggregations in MongoDB

2020-06-04 04:31发布

问题:

In exploring ways to do real-time analytics with MongoDB, there seems to be a fairly standard way to do sums, but nothing in terms of more complex aggregation. Some things that have helped...

  • Twitter's Rainbird: Realtime sums, incrementing counters on keys hierarchically. Cassandra.
  • Yahoo's S4 and source: Not sure exactly how this works yet, but it looks like it's real-time map-reduce. So basically, for every record that's added, you pass it to a mapper, it converts it to a hash, and that sends it to be integrated into the report document.
  • http://www.slideshare.net/dacort/mongodb-realtime-data-collection-and-stats-generation
  • Hummingbird

The basic approach for doing sums is to atomically increment document keys for each new record that comes in, to cache common queries:

Stats.collection.update({"keys" => ["a", "b", "c"]}, {"$inc" => {"counter_1" => 1, "counter_2" => 1"}, "upsert" => true);

This doesn't work for aggregates other than sums though. My question is, can something like this be done for averages, min, and max in mongodb?

Say you have a document like this:

{
  :date => "04/27/2011",
  :page_views => 1000,
  :user_birthdays => ["12/10/1980", "6/22/1971", ...] # 1000 total
}

Could you do some atomic or optimized/real-time operation that grouped the birthdays into something like this?

{
  :date => "04/27/2011",
  :page_views => 1000,
  :user_birthdays => ["12/10/1980", "6/22/1971", ...], # 1000 total
  :average_age => 27.8,
  :age_rank => {
    "0 to 20" => 180,
    "20 to 30" => 720,
    "30 to 40" => 100,
    "40 to 50" => 0
  }
}

...just like you can do Doc.collection.update({x => 1}, {"$push" => {"user_birthdays" => "12/10/1980"}}) to add something to an array, and not have to load the document in, can you do something like that to average/aggregate the array? Is there something along these lines that you use for real-time aggregation?

MapReduce is used to do this in batch-processing jobs, I'm looking for patterns for something like real-time map-reduce for:

  1. Averages: every time you push a new item to an array in mongodb, what's the best way to average those values in real-time?
  2. Grouping: if you group age for 10-year brackets, and you have an ages array, how could you optimally update the count for each group as you're updating the document with the new age? say the ages array will be constantly pushed/pulled.
  3. Min/Max: what are some ways to compute and store the min/max of that ages array in that document?

回答1:

Could you do some atomic or optimized/real-time operation that grouped the birthdays into something like this?

It looks like you've added two fields age_rank, average_age. These are effectively calculated fields based on the data you already have. If I gave you the document with page views and user birthdays, it should be really trivial for the client code to find min/max, average, etc.

It seems to me that you're asking for MongoDB to perform the aggregation for you server-side. But you're adding the limitation that you don't want to use Map/Reduce?

If I'm understanding your question correctly, you're looking for something where you can say "add this item to an array and have all dependent items update themselves"? You don't want readers to perform any logic, you want everything to happen "magically" on the server side.

So there are three different ways to tackle this, but only one of them is currently available:

  1. Write this logic client-side. It doesn't sound like the solution you want, but it will work. If you have the underlying data, doing a max/min/med/avg should be pretty trivial in most languages.
  2. Leverage the upcoming features for Aggregation. These are not scheduled until 1.9.x. Improved aggregation will allow to extract the data you're looking for, however, you'll still have to write the appropriate queries. The underlying DB still does not contain the data you're looking for.
  3. You need triggers. If you really want the DB to always consistent and contain summarized data, then this is what you need. However, the triggers feature does not yet exist.

Unfortunately, your only option right now is #1. Fortunately, I know of several people that are using option #1 successfully.



回答2:

There is work planned for the upcoming 1.9.x unstable release that may have aggregations.

See: https://jira.mongodb.org/browse/SERVER-447

Of course, it may get bumepd to a later release/