I have one requirement where i need to do aggregation on two records both have an array field with different value. What I need that when I do aggregation on these records the result should have one array with unique values from both different arrays. Here is example :
First record
{ Host:"abc.com" ArtId:"123", tags:[ "tag1", "tag2" ] }
Second record
{ Host:"abc.com" ArtId:"123", tags:[ "tag2", "tag3" ] }
After aggregation on host and artid i need result like this:
{ Host: "abc.com", ArtId: "123", count :"2", tags:[ "tag1", "tag2", "tag3" ]}
I tried $addToset
in group statement but it gives me like this tags :[["tag1","tag2"],["tag2","tag3"]]
Could you please help me how i can achieve this in aggregation
TLDR;
Modern releases should use $reduce
with $setUnion
after the initial $group
as is shown:
db.collection.aggregate([
{ "$group": {
"_id": { "Host": "$Host", "ArtId": "$ArtId" },
"count": { "$sum": 1 },
"tags": { "$addToSet": "$tags" }
}},
{ "$addFields": {
"tags": {
"$reduce": {
"input": "$tags",
"initialValue": [],
"in": { "$setUnion": [ "$$value", "$$this" ] }
}
}
}}
])
You were right in finding the $addToSet
operator, but when working with content in an array you generally need to process with $unwind
first. This "de-normalizes" the array entries and essentially makes a "copy" of the parent document with each array entry as a singular value in the field. That's what you need to avoid the behavior you are seeing without using that.
Your "count" poses an interesting problem though, but easily solved through the use of a "double unwind" after an initial $group
operation:
db.collection.aggregate([
// Group on the compound key and get the occurrences first
{ "$group": {
"_id": { "Host": "$Host", "ArtId": "$ArtId" },
"tcount": { "$sum": 1 },
"ttags": { "$push": "$tags" }
}},
// Unwind twice because "ttags" is now an array of arrays
{ "$unwind": "$ttags" },
{ "$unwind": "$ttags" },
// Now use $addToSet to get the distinct values
{ "$group": {
"_id": "$_id",
"tcount": { "$first": "$tcount" },
"tags": { "$addToSet": "$ttags" }
}},
// Optionally $project to get the fields out of the _id key
{ "$project": {
"_id": 0,
"Host": "$_id.Host",
"ArtId": "$_id.ArtId",
"count": "$tcount",
"tags": "$ttags"
}}
])
That final bit with $project
is also there because I used "temporary" names for each of the fields in other stages of the aggregation pipeline. This is because there is an optimization in $project
that "copies" the fields from an existing stage in the order they already appeared "before" any "new" fields are added to the document.
Otherwise the output would look like:
{ "count":2 , "tags":[ "tag1", "tag2", "tag3" ], "Host": "abc.com", "ArtId": "123" }
Where the fields are not in the same order as you might think. Trivial really, but it matters to some people, so worth explaining why, and how to handle.
So $unwind
does the work to keep the items separated and not in arrays, and doing the $group
first allows you to get the "count" of the occurrences of the "grouping" key.
The $first
operator used later "keeps" that "count" value, as it just got "duplicated" for every value present in the "tags" array. It's all the same value anyway so it does not matter. Just pick one.