I am looking for a way to generate some summary statistics using Mongo. Suppose I have a collection with many records of the form
{"name" : "Jeroen", "gender" : "m", "age" :27.53 }
Now I want to get the distributions for gender and age. Assume for gender, there are only values "m"
and "f"
. What is the most efficient way of getting the total count of males and females in my collection?
And for age, is there a way that does some 'binning' and gives me a histogram like summary; i.e. the number of records where age is in the intervals: [0, 2), [2, 4), [4, 6) ...
etc?
an easy way to get the total count of males would be
db.x.find({"gender": "m"}).count()
If you want both male and female counts in just one query, then there is no easy way. Map/reduce would be one possibility. Or perhaps the new aggregation framework. The same is true for your binning requirement
Mongo is not great for aggregation, but it's fantastic for many small incremental updates. So the best way to solve this problem with mongo would be to collect the aggregation data in a seperate collection.
So, if you keep a stats collection with one document like this:
... then everytime you add or remove a person from the other collection, you count the respective fields up or down in the stats collection.
Queries to stats will be lightning fast this way, and you will hardly notice any performance overhead from counting the stats up and down.
Based on the answer of @ColinE binning for histogram can be done by
$bucketAuto
did not work for me since buckets seem to be collected on a logarithmic scale.allowDiskUse
is only necessary if you have millions of documentsI just tried out the new aggregation framework that will be available in MongoDB version 2.2 (2.2.0-rc0 has been released), which should have higher performance than map reduce since it doesn't rely on Javascript.
input data:
aggregation command for gender:
result:
To get the ages in bins:
result:
Konstantin's answer was right. MapReduce gets the job done. Here is the full solution in case others find this interesting.
To count genders, the map function key is the
this.gender
attribute for every record. The reduce function then simply adds them up:To do the binning, we set the key in the map function to round down to the nearest division by two. Therefore e.g. any value between 10 and 11.9999 will get the same key
"10-12"
. And then again we simply add them up:Depending on amount of data most effective way to find amount of males and females could be either naive query or map reduce job. Binning is best done via map reduce:
In the map phase your key is a bin, and value is 1, and in the reduce phase you just sum up values
With Mongo 3.4 this just got even easier, thanks to the new $bucket and $bucketAuto aggregation functions. The following query auto-buckets into two groups:
With the following input data:
It gives the following result:
Note, bucket and auto-bucket are typically used for continuous variables (numeric, date), but in this case auto-bucket works just fine.