In exploring ways to do real-time analytics with MongoDB, there seems to be a fairly standard way to do sums, but nothing in terms of more complex aggregation. Some things that have helped...
- Twitter's Rainbird: Realtime sums, incrementing counters on keys hierarchically. Cassandra.
- Yahoo's S4 and source: Not sure exactly how this works yet, but it looks like it's real-time map-reduce. So basically, for every record that's added, you pass it to a mapper, it converts it to a hash, and that sends it to be integrated into the report document.
- http://www.slideshare.net/dacort/mongodb-realtime-data-collection-and-stats-generation
- Hummingbird
The basic approach for doing sums is to atomically increment document keys for each new record that comes in, to cache common queries:
Stats.collection.update({"keys" => ["a", "b", "c"]}, {"$inc" => {"counter_1" => 1, "counter_2" => 1"}, "upsert" => true);
This doesn't work for aggregates other than sums though. My question is, can something like this be done for averages, min, and max in mongodb?
Say you have a document like this:
{
:date => "04/27/2011",
:page_views => 1000,
:user_birthdays => ["12/10/1980", "6/22/1971", ...] # 1000 total
}
Could you do some atomic or optimized/real-time operation that grouped the birthdays into something like this?
{
:date => "04/27/2011",
:page_views => 1000,
:user_birthdays => ["12/10/1980", "6/22/1971", ...], # 1000 total
:average_age => 27.8,
:age_rank => {
"0 to 20" => 180,
"20 to 30" => 720,
"30 to 40" => 100,
"40 to 50" => 0
}
}
...just like you can do Doc.collection.update({x => 1}, {"$push" => {"user_birthdays" => "12/10/1980"}})
to add something to an array, and not have to load the document in, can you do something like that to average/aggregate the array? Is there something along these lines that you use for real-time aggregation?
MapReduce is used to do this in batch-processing jobs, I'm looking for patterns for something like real-time map-reduce for:
- Averages: every time you push a new item to an array in mongodb, what's the best way to average those values in real-time?
- Grouping: if you group age for 10-year brackets, and you have an ages array, how could you optimally update the count for each group as you're updating the document with the new age? say the ages array will be constantly pushed/pulled.
- Min/Max: what are some ways to compute and store the min/max of that ages array in that document?