Grouping documents in pairs using mongo aggregatio

2019-06-02 07:55发布

问题:

I have a collection of items,

[ a, b, c, d ]

And I want to group them in pairs such as,

[ [ a, b ], [ b, c ], [ c, d ] ]

This will be used in calculating the differences between each item in the original collection, but that part is solved using several techniques such as the one in this question.

I know that this is possible with map reduce, but I want to know if it's possible with aggregation.

Edit: Here's an example,

The collection of items; each item is an actual document.

[
    { val: 1 },
    { val: 3 },
    { val: 6 },
    { val: 10 },
]

Grouped version:

[
    [ { val: 1 }, { val: 3 } ], 
    [ { val: 3 }, { val: 6 } ],
    [ { val: 6 }, { val: 10 } ]
]

The resulting collection (or aggregation result):

[
    { diff: 2 },
    { diff: 3 },
    { diff: 4 }
]

回答1:

This is something that just cannot be done with the aggregation framework, and the only current MongoDB method available for this type of operation is mapReduce.

The reason being that the a aggregation framework has no way of referring to any other document in the pipeline than the present one. This actually applies to "grouping" pipeline stages as well, since even though things are grouped on a "key" you cant really deal with individual documents in the way you want to.

MapReduce on the other hand has one feature available that allows you to do what you want here, and it's not even "directly" related to aggregation. It is in fact the ability to have "globally scoped variables" across all stages. And having a "variable" to basically "store the last document" is all you need to achieve your result.

So it's quite simple code, and there is in fact no "reduction" required:

db.collection.mapReduce(
    function () {
      if (lastVal != null)
        emit( this._id, this.val - lastVal );
      lastVal = this.val;
    },
    function() {}, // mapper is not called
    {
        "scope": { "lastVal": null },
        "out": { "inline": 1 }
    }
)

Which gives you a result much like this:

{
    "results" : [
            {
                    "_id" : ObjectId("54a425a99b8bcd6f73e2d662"),
                    "value" : 2
            },
            {
                    "_id" : ObjectId("54a425a99b8bcd6f73e2d663"),
                    "value" : 3
            },
            {
                    "_id" : ObjectId("54a425a99b8bcd6f73e2d664"),
                    "value" : 4
            }
    ],
    "timeMillis" : 3,
    "counts" : {
            "input" : 4,
            "emit" : 3,
            "reduce" : 0,
            "output" : 3
    },
    "ok" : 1
}

That's really just picking "something unique" as the emitted _id value rather than anything specific, because all this is really doing is the difference between values on differing documents.

Global variables are usually the solution to these types of "pairing" aggregations or producing "running totals". Right now the aggregation framework has no access to global variables, even though it might well be a nice this to have. The mapReduce framework has them, so it is probably fair to say that they should be available to the aggregation framework as well.

Right now they are not though, so stick with mapReduce.