In exploring ways to do real-time analytics with MongoDB, there seems to be a fairly standard way to do sums, but nothing in terms of more complex aggregation. Some things that have helped...
- Twitter's Rainbird: Realtime sums, incrementing counters on keys hierarchically. Cassandra.
- Yahoo's S4 and source: Not sure exactly how this works yet, but it looks like it's real-time map-reduce. So basically, for every record that's added, you pass it to a mapper, it converts it to a hash, and that sends it to be integrated into the report document.
- http://www.slideshare.net/dacort/mongodb-realtime-data-collection-and-stats-generation
- Hummingbird
The basic approach for doing sums is to atomically increment document keys for each new record that comes in, to cache common queries:
Stats.collection.update({"keys" => ["a", "b", "c"]}, {"$inc" => {"counter_1" => 1, "counter_2" => 1"}, "upsert" => true);
This doesn't work for aggregates other than sums though. My question is, can something like this be done for averages, min, and max in mongodb?
Say you have a document like this:
{
:date => "04/27/2011",
:page_views => 1000,
:user_birthdays => ["12/10/1980", "6/22/1971", ...] # 1000 total
}
Could you do some atomic or optimized/real-time operation that grouped the birthdays into something like this?
{
:date => "04/27/2011",
:page_views => 1000,
:user_birthdays => ["12/10/1980", "6/22/1971", ...], # 1000 total
:average_age => 27.8,
:age_rank => {
"0 to 20" => 180,
"20 to 30" => 720,
"30 to 40" => 100,
"40 to 50" => 0
}
}
...just like you can do Doc.collection.update({x => 1}, {"$push" => {"user_birthdays" => "12/10/1980"}})
to add something to an array, and not have to load the document in, can you do something like that to average/aggregate the array? Is there something along these lines that you use for real-time aggregation?
MapReduce is used to do this in batch-processing jobs, I'm looking for patterns for something like real-time map-reduce for:
- Averages: every time you push a new item to an array in mongodb, what's the best way to average those values in real-time?
- Grouping: if you group age for 10-year brackets, and you have an ages array, how could you optimally update the count for each group as you're updating the document with the new age? say the ages array will be constantly pushed/pulled.
- Min/Max: what are some ways to compute and store the min/max of that ages array in that document?
There is work planned for the upcoming 1.9.x unstable release that may have aggregations.
See: https://jira.mongodb.org/browse/SERVER-447
Of course, it may get bumepd to a later release/
It looks like you've added two fields
age_rank
,average_age
. These are effectively calculated fields based on the data you already have. If I gave you the document with page views and user birthdays, it should be really trivial for the client code to find min/max, average, etc.It seems to me that you're asking for MongoDB to perform the aggregation for you server-side. But you're adding the limitation that you don't want to use Map/Reduce?
If I'm understanding your question correctly, you're looking for something where you can say "add this item to an array and have all dependent items update themselves"? You don't want readers to perform any logic, you want everything to happen "magically" on the server side.
So there are three different ways to tackle this, but only one of them is currently available:
Unfortunately, your only option right now is #1. Fortunately, I know of several people that are using option #1 successfully.