How to merge array field in document in Mongo aggr

2019-02-01 02:53发布

问题:

I have one requirement where i need to do aggregation on two records both have an array field with different value. What I need that when I do aggregation on these records the result should have one array with unique values from both different arrays. Here is example :

First record

 { Host:"abc.com" ArtId:"123", tags:[ "tag1", "tag2" ] }

Second record

{ Host:"abc.com" ArtId:"123", tags:[ "tag2", "tag3" ] }

After aggregation on host and artid i need result like this:

 { Host: "abc.com", ArtId: "123", count :"2", tags:[ "tag1", "tag2", "tag3" ]}

I tried $addToset in group statement but it gives me like this tags :[["tag1","tag2"],["tag2","tag3"]]

Could you please help me how i can achieve this in aggregation

回答1:

TLDR;

Modern releases should use $reduce with $setUnion after the initial $group as is shown:

db.collection.aggregate([
  { "$group": {
    "_id": { "Host": "$Host", "ArtId": "$ArtId" },
    "count": { "$sum": 1 },
    "tags": { "$addToSet": "$tags" }
  }},
  { "$addFields": {
    "tags": {
      "$reduce": {
        "input": "$tags",
        "initialValue": [],
        "in": { "$setUnion": [ "$$value", "$$this" ] }
      }
    }
  }}
])

You were right in finding the $addToSet operator, but when working with content in an array you generally need to process with $unwind first. This "de-normalizes" the array entries and essentially makes a "copy" of the parent document with each array entry as a singular value in the field. That's what you need to avoid the behavior you are seeing without using that.

Your "count" poses an interesting problem though, but easily solved through the use of a "double unwind" after an initial $group operation:

db.collection.aggregate([
    // Group on the compound key and get the occurrences first
    { "$group": {
        "_id": { "Host": "$Host", "ArtId": "$ArtId" },
        "tcount": { "$sum": 1 },
        "ttags": { "$push": "$tags" }
    }},

    // Unwind twice because "ttags" is now an array of arrays
    { "$unwind": "$ttags" },
    { "$unwind": "$ttags" },

    // Now use $addToSet to get the distinct values        
    { "$group": {
        "_id": "$_id",
        "tcount": { "$first": "$tcount" },
        "tags": { "$addToSet": "$ttags" }
    }},

    // Optionally $project to get the fields out of the _id key
    { "$project": {
        "_id": 0,
        "Host": "$_id.Host",
        "ArtId": "$_id.ArtId",
        "count": "$tcount",
        "tags": "$ttags"
    }}
])

That final bit with $project is also there because I used "temporary" names for each of the fields in other stages of the aggregation pipeline. This is because there is an optimization in $project that "copies" the fields from an existing stage in the order they already appeared "before" any "new" fields are added to the document.

Otherwise the output would look like:

{  "count":2 , "tags":[ "tag1", "tag2", "tag3" ], "Host": "abc.com", "ArtId": "123" }

Where the fields are not in the same order as you might think. Trivial really, but it matters to some people, so worth explaining why, and how to handle.

So $unwind does the work to keep the items separated and not in arrays, and doing the $group first allows you to get the "count" of the occurrences of the "grouping" key.

The $first operator used later "keeps" that "count" value, as it just got "duplicated" for every value present in the "tags" array. It's all the same value anyway so it does not matter. Just pick one.