Get filtered count of elements in array from $look

2019-01-28 21:56发布

问题:

I have this query in MongoDB:

db.emailGroup.aggregate([
    {
        "$lookup": 
        {
            "from": "link",
            "localField": "_id",
            "foreignField": "emailGroupId",
            "as": "link"
        },
    },
    {
        "$unwind": "$link"
    },
    {
        "$match": {
             'link.originalLink': ""
        }
    },
    {
        "$group" : {
            _id: '$_id',
            link: {
                $push: '$link'
            }
        }
    },
    {
        "$project": { 
            "size": { 
                "$sum": { 
                    "$map": { 
                        "input": "$link", 
                        "as": "l", 
                        "in": { 
                            "$size": {
                                "$ifNull": [
                                    "$$l.linkHistory", []
                                ]
                            }
                        } 
                    } 
                } 
            }
        }
    }
])

EmailGroup has partId field. I use $lookup to "join" other collection for sum her fields. I need group by partId field and sum custom field "size" for partId groups. Is this possible? Extra question: How can I add emailGroup fields to query result?

Sample documents:

emailGroup:

{
    "_id" : ObjectId("594a6c47f51e075db713ccb6"),
    "partId" : "f56c7c71eb14a20e6129a667872f9c4f",
}

link:

{
    "_id" : ObjectId("594b96d6f51e075db67c44c9"),
    "originalLink" : "",
    "emailGroupId" : ObjectId("594a6c47f51e075db713ccb6"),
    "linkHistory" : [ 
        {
            "_id" : ObjectId("594b96f5f51e075db713ccdf"),
        }, 
        {
            "_id" : ObjectId("594b971bf51e075db67c44ca"),
        }
    ]
}

回答1:

Annotation for those looking for - Foreign Count

A bit better than was originally answered is to actually use the newer form of $lookup from MongoDB 3.6. This can actually do the "counting" within the "sub-pipeline" expression as opposed to returning an "array" for subsequent filtering and counting or even using $unwind

db.emailGroup.aggregate([
  { "$lookup": {
    "from": "link",
    "let": { "id": "$_id" },
    "pipeline": [
      { "$match": {
        "originalLink": "",
        "$expr": { "$eq": [ "$$id", "$_id" ] }
      }},
      { "$count": "count" }
    ],
    "as": "linkCount"    
  }},
  { "$addFields": {
    "linkCount": { "$sum": "$linkCount.count" }
  }}
])

Not what the original question was asking for but part of the below answer in the now most optimal form, as of course the result of $lookup is reduced to the "matched count" only instead of "all matched documents".


Original

The correct way to do this would be to add the "linkCount" to the $group stage as well as a $first on any additional fields of the parent document in order to get the "singular" form as was the state "before" the $unwind was processed on the array that was the result of $lookup:

All Detail

db.emailGroup.aggregate([
  { "$lookup": {
    "from": "link",
    "localField": "_id",
    "foreignField": "emailGroupId",
    "as": "link"    
  }},
  { "$unwind": "$link" },
  { "$match": { "link.originalLink": "" } },
  { "$group": {
    "_id": "$_id",
    "partId": { "$first": "$partId" },
    "link": { "$push": "$link" },
    "linkCount": {
      "$sum": {
        "$size": {
          "$ifNull": [ "$link.linkHistory", [] ]
        } 
      }   
    }
  }}
])

Produces:

{
    "_id" : ObjectId("594a6c47f51e075db713ccb6"),
    "partId" : "f56c7c71eb14a20e6129a667872f9c4f",
    "link" : [ 
        {
            "_id" : ObjectId("594b96d6f51e075db67c44c9"),
            "originalLink" : "",
            "emailGroupId" : ObjectId("594a6c47f51e075db713ccb6"),
            "linkHistory" : [ 
                {
                    "_id" : ObjectId("594b96f5f51e075db713ccdf")
                }, 
                {
                    "_id" : ObjectId("594b971bf51e075db67c44ca")
                }
            ]
        }
    ],
    "linkCount" : 2
}

Group By partId

db.emailGroup.aggregate([
  { "$lookup": {
    "from": "link",
    "localField": "_id",
    "foreignField": "emailGroupId",
    "as": "link"    
  }},
  { "$unwind": "$link" },
  { "$match": { "link.originalLink": "" } },
  { "$group": {
    "_id": "$partId",
    "linkCount": {
      "$sum": {
        "$size": {
          "$ifNull": [ "$link.linkHistory", [] ]
        } 
      }   
    }
  }}
])

Produces

{
    "_id" : "f56c7c71eb14a20e6129a667872f9c4f",
    "linkCount" : 2
}

The reason you do it this way with a $unwind and then a $match is because of how MongoDB actually handles the pipeline when issued in that order. This is what happens to the $lookup as is demonstrated the the "explain" output from the operation:

    {
        "$lookup" : {
            "from" : "link",
            "as" : "link",
            "localField" : "_id",
            "foreignField" : "emailGroupId",
            "unwinding" : {
                "preserveNullAndEmptyArrays" : false
            },
            "matching" : {
                "originalLink" : {
                    "$eq" : ""
                }
            }
        }
    }, 
    {
        "$group" : {

I'm leaving the part with $group in that output to demonstrate that the other two pipeline stages "disappear". This is because they have been "rolled-up" into the $lookup pipeline stage as shown. This is in fact how MongoDB deals with the possibility that the BSON Limit can be exceeded by the result of "joining" results of $lookup into an array of the parent document.

You can alternately write the operation like this:

All Detail

db.emailGroup.aggregate([
  { "$lookup": {
    "from": "link",
    "localField": "_id",
    "foreignField": "emailGroupId",
    "as": "link"    
  }},
  { "$addFields": {
    "link": {
      "$filter": {
        "input": "$link",
        "as": "l",
        "cond": { "$eq": [ "$$l.originalLink", "" ] }    
      }
    },
    "linkCount": {
      "$sum": {
        "$map": {
          "input": {
            "$filter": {
              "input": "$link",
              "as": "l",
              "cond": { "$eq": [ "$$l.originalLink", "" ] }
            }
          },
          "as": "l",
          "in": { "$size": { "$ifNull": [ "$$l.linkHistory", [] ] } }
        }     
      }
    }    
  }}
])

Group by partId

db.emailGroup.aggregate([
  { "$lookup": {
    "from": "link",
    "localField": "_id",
    "foreignField": "emailGroupId",
    "as": "link"    
  }},
  { "$addFields": {
    "link": {
      "$filter": {
        "input": "$link",
        "as": "l",
        "cond": { "$eq": [ "$$l.originalLink", "" ] }    
      }
    },
    "linkCount": {
      "$sum": {
        "$map": {
          "input": {
            "$filter": {
              "input": "$link",
              "as": "l",
              "cond": { "$eq": [ "$$l.originalLink", "" ] }
            }
          },
          "as": "l",
          "in": { "$size": { "$ifNull": [ "$$l.linkHistory", [] ] } }
        }     
      }
    }    
  }},
  { "$unwind": "$link" },
  { "$group": {
    "_id": "$partId",
    "linkCount": { "$sum": "$linkCount" } 
  }}
])

Which has the same output but "differs" from the first query in that the $filter here is applied "after" ALL results of the $lookup are returned into the new array of the parent document.

So in performance terms, it is actually more effective to do it the first way as well as being portable to possible large result sets "before filtering" which would otherwise break the 16MB BSON Limit.

Note: If you do not have $addFields available (added with MongoDB 3.4) then use $project and specify "ALL" of the fields you wish to return.


As a side-note for those who are interested, in future releases of MongoDB ( presumably 3.6 and up ) you can use $replaceRoot instead of an $addFields with usage of the new $mergeObjects pipeline operator. The advantage of this is as a "block", we can declare the "filtered" content as a variable via $let, which means you do not need to write the same $filter "twice":

db.emailGroup.aggregate([
  { "$lookup": {
    "from": "link",
    "localField": "_id",
    "foreignField": "emailGroupId",
    "as": "link"    
  }},
  { "$replaceRoot": {
    "newRoot": {
      "$mergeObjects": [
        "$$ROOT",
        { "$let": {
          "vars": {
            "filtered": {
              "$filter": {
                "input": "$link",
                "as": "l",
                "cond": { "$eq": [ "$$l.originalLink", "" ] }    
              }
            }
          },
          "in": {
            "link": "$$filtered",
            "linkCount": {
              "$sum": {
                "$map": {
                  "input": "$$filtered.linkHistory",
                  "as": "lh",
                  "in": { "$size": { "$ifNull": [ "$$lh", [] ] } } 
                }   
              } 
            }  
          }
        }}
      ]
    }
  }}
])

Nonetheless the best way to do such "filtered" $lookup operations is "still" at this time using the $unwind then $match pattern, until such time as you can provide query arguments to $lookup directly.

Note: By "directly" here I do not mean the "non-correlated" form of $lookup which should also be in the MongoDB 3.6 release, as this would indeed issue another "pipeline" execution for each document in the parent collection. So that feature still does not replace the present "best effective" way of retrieving only the matched items.