MongoDB sum arrays from multiple documents on a pe

2020-03-24 08:02发布

问题:

I have the following document structure (simplified for this example)

{
  _id : ObjectId("sdfsdf"),
  result : [1, 3, 5, 7, 9]
},
{
  _id : ObjectId("asdref"),
  result : [2, 4, 6, 8, 10]
}

I want to get the sum of those result arrays, but not a total sum, instead a new array corresponding to the sum of the original arrays on an element basis, i.e.

result : [3, 7, 11, 15, 19]

I have searched through the myriad questions here and a few come close (e.g. this one, this one, and this one), but I can't quite get there.

I can get the sum of each array fine

aggregate(
    [
      {
        "$unwind" : "$result"
      },
      {
        "$group": {
          "_id": "$_id",
          "results" : { "$sum" : "$result"}
          }
      }
    ]
)

which gives me

[ { _id: sdfsdf, results: 25 },
  { _id: asdref, results: 30 } ]

but I can't figure out how to get the sum of each element

回答1:

You can use includeArrayIndex if you have 3.2 or newer MongoDb.

Then you should change $unwind.

Your code should be like this:

.aggregate(
    [
      {
        "$unwind" :  { path: "$result", includeArrayIndex: "arrayIndex" }
      },
      {
        "$group": {
          "_id": "$arrayIndex",
          "results" : { "$sum" : "$result"}
          }
      },
      { 
        $sort: { "_id": 1}
      },
      {
        "$group":{
          "_id": null,
          "results":{"$push":"$results"}
          } 
      },
      {
        "$project": {"_id":0,"results":1}
      }
    ]
)


回答2:

There is an alternate approach to this, but mileage may vary on how practical it is considering that a different approach would involve using $push to create an "array of arrays" and then applying $reduce as introduced in MongoDB 3.4 to $sum those array elements into a single array result:

db.collection.aggregate([
  { "$group": {
    "_id": null,
    "result": { "$push": "$result" }
  }},
  { "$addFields": {
    "result": {
      "$reduce": {
        "input": "$result",
        "initialValue": [],
        "in": {
          "$map": {
            "input": { 
              "$zip": {
                "inputs": [ "$$this", "$$value" ],
                "useLongestLength": true
              }
            },
            "as": "el",
            "in": { "$sum": "$$el" }
          }
        }
      }
    }
  }}
])

The real trick there is in the "input" to $map we use the $zip operation which creates a transposed list of arrays "pairwise" for the two array inputs.

In a first iteration this takes the empty array as supplied to $reduce and would return the "zipped" output with consideration to the first object found as in:

[ [0,1], [0,3], [0,5], [0,7], [0,9] ]

So the useLongestLength would substitute the empty array with 0 values out to the the length of the current array and "zip" them together as above.

Processing with $map, each element is subject to $sum which "reduces" the returned results as:

[ 1, 3, 5, 7, 9 ]

On the second iteration, the next entry in the "array of arrays" would be picked up and processed by $zip along with the previous "reduced" content as:

[ [1,2], [3,4], [5,6], [7,8], [9,10] ]

Which is then subject to the $map for each element using $sum again to produce:

[ 3, 7, 11, 15, 19 ]

And since there were only two arrays pushed into the "array of arrays" that is the end of the operation, and the final result. But otherwise the $reduce would keep iterating until all array elements of the input were processed.


So in some cases this would be the more performant option and what you should be using. But it is noted that particularly when using a null for $group you are asking "every" document to $push content into an array for the result.

This could be a cause of breaking the BSON Limit in extreme cases, and therefore when aggregating positional array content over large results, it is probably best to use $unwind with the includeArrayIndex option instead.

Or indeed actually take a good look at the process, where in particular if the "positional array" in question is actually the result of some other "aggregation operation", then you should rather be looking at the previous pipeline stages that were used to create the "positional array". And then consider that if you wanted those positions "aggregated further" to new totals, then you should in fact do that "before" the positional result was obtained.