Binning and tabulate (unique/count) in Mongo

2019-04-06 15:09发布

I am looking for a way to generate some summary statistics using Mongo. Suppose I have a collection with many records of the form

{"name" : "Jeroen", "gender" : "m", "age" :27.53 }

Now I want to get the distributions for gender and age. Assume for gender, there are only values "m" and "f". What is the most efficient way of getting the total count of males and females in my collection?

And for age, is there a way that does some 'binning' and gives me a histogram like summary; i.e. the number of records where age is in the intervals: [0, 2), [2, 4), [4, 6) ... etc?

6条回答
Ridiculous、
2楼-- · 2019-04-06 15:46

an easy way to get the total count of males would be db.x.find({"gender": "m"}).count()

If you want both male and female counts in just one query, then there is no easy way. Map/reduce would be one possibility. Or perhaps the new aggregation framework. The same is true for your binning requirement

Mongo is not great for aggregation, but it's fantastic for many small incremental updates. So the best way to solve this problem with mongo would be to collect the aggregation data in a seperate collection.

So, if you keep a stats collection with one document like this:

stats: [
  {
     "male": 23,
     "female": 17,
     "ageDistribution": {
       "0_2" : 3,
       "2_4" : 5,
       "4_6" : 7
     }
  }
]

... then everytime you add or remove a person from the other collection, you count the respective fields up or down in the stats collection.

db.stats.update({"$inc": {"male": 1, "ageDistribution.2_4": 1}})

Queries to stats will be lightning fast this way, and you will hardly notice any performance overhead from counting the stats up and down.

查看更多
老娘就宠你
3楼-- · 2019-04-06 15:49

Based on the answer of @ColinE binning for histogram can be done by

db.persons.aggregate([
  {
  $bucket: {
    groupBy: "$j.age",
    boundaries: [0,2,4,6,8,10,12,14,16,18,20],
    default: "Other",
    output: {
      "count": { $sum: 1 }
    }
  }
],
{allowDiskUse:true})

$bucketAuto did not work for me since buckets seem to be collected on a logarithmic scale. allowDiskUse is only necessary if you have millions of documents

查看更多
趁早两清
4楼-- · 2019-04-06 15:52

I just tried out the new aggregation framework that will be available in MongoDB version 2.2 (2.2.0-rc0 has been released), which should have higher performance than map reduce since it doesn't rely on Javascript.

input data:

{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }

aggregation command for gender:

db.collection.aggregate(
   {$project: {gender:1}},
   {$group: {
        _id: "$gender",
        count: {$sum: 1}
   }})

result:

{"result" : 
   [
     {"_id" : "m", "count" : 2},
     {"_id" : "f", "count" : 3}
   ],
   "ok" : 1
}

To get the ages in bins:

db.collection.aggregate(
   {$project: {
        ageLowerBound: {$subtract:["$age", {$mod:["$age",2]}]}}
   },
   {$group: {
       _id:"$ageLowerBound", 
       count:{$sum:1}
   }
})

result:

{"result" : 
    [
       {"_id" : 26, "count" : 3},
       {"_id" : 22, "count" : 2}
    ],
    "ok" : 1
}
查看更多
▲ chillily
5楼-- · 2019-04-06 16:09

Konstantin's answer was right. MapReduce gets the job done. Here is the full solution in case others find this interesting.

To count genders, the map function key is the this.gender attribute for every record. The reduce function then simply adds them up:

// count genders
db.persons.mapReduce(
    function(){
        emit(this["gender"], {count: 1})
    }, function(key, values){
        var result = {count: 0};
        values.forEach(function(value) {
            result.count += value.count;
        });
        return result;
    }, {out: { inline : 1}}
);

To do the binning, we set the key in the map function to round down to the nearest division by two. Therefore e.g. any value between 10 and 11.9999 will get the same key "10-12". And then again we simply add them up:

db.responses.mapReduce(
    function(){
        var x = Math.floor(this["age"]/2)*2;
        var key = x + "-" + (x+2);
        emit(key, {count: 1})
    }, function(state, values){
        var result = {count: 0};
        values.forEach(function(value) {
            result.count += value.count;
        });
        return result;
    }, {out: { inline : 1}}
);
查看更多
来,给爷笑一个
6楼-- · 2019-04-06 16:11

Depending on amount of data most effective way to find amount of males and females could be either naive query or map reduce job. Binning is best done via map reduce:

In the map phase your key is a bin, and value is 1, and in the reduce phase you just sum up values

查看更多
劫难
7楼-- · 2019-04-06 16:12

With Mongo 3.4 this just got even easier, thanks to the new $bucket and $bucketAuto aggregation functions. The following query auto-buckets into two groups:

db.bucket.aggregate( [
   {
     $bucketAuto: {
         groupBy: "$gender",
         buckets: 2
     }
   }
] )

With the following input data:

{ "_id" : 1, "age" : 22.34, "gender" : "f" }
{ "_id" : 2, "age" : 23.9, "gender" : "f" }
{ "_id" : 3, "age" : 27.4, "gender" : "f" }
{ "_id" : 4, "age" : 26.9, "gender" : "m" }
{ "_id" : 5, "age" : 26, "gender" : "m" }

It gives the following result:

{ "_id" : { "min" : "f", "max" : "m" }, "count" : 3 }
{ "_id" : { "min" : "m", "max" : "m" }, "count" : 2 }

Note, bucket and auto-bucket are typically used for continuous variables (numeric, date), but in this case auto-bucket works just fine.

查看更多
登录 后发表回答