MongoDB Document Re-shaping

2019-01-15 23:24发布

站内文章 / MongoDB

18 0

贼婆χ

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

This question comes out of (as mine usually do) perusing the questions asked on SO and as such, raising another question for myself. So apart from the learning exercise in working towards a solution for a problem, I find that another question pops up, such as this.

The original question as yet remains unaccepted by the OP, and indeed has not been clarified as to what "they" wanted to achieve. But I did give my interpretation, in both the simple and long forms of arriving at a solution.

The process, in the end, has left me wondering that considering the long form of the solution, would there be some new feature to be introduced in the next (Currently expecting 2.6) MongoDB release, using the additional aggregation operators that have been introduced.

So the case is as follows:

Sample Documents

{
    "tracked_item_type" : "Software",
    "tracked_item_name" : "Word",
    "duration" : 9540
}
{
    "tracked_item_type" : "Software",
    "tracked_item_name" : "Excel",
    "duration" : 4000
}
{
    "tracked_item_type" : "Software",
    "tracked_item_name" : "Notepad",
    "duration" : 4000
}
{
    "tracked_item_type" : "Site",
    "tracked_item_name" : "Facebook",
    "duration" : 7920
}
{
    "tracked_item_type" : "Site",
    "tracked_item_name" : "Twitter",
    "duration" : 5555
}
{
    "tracked_item_type" : "Site",
    "tracked_item_name" : "Digital Blasphemy",
    "duration" : 8000
}

Desired Result

The top two results by each type, ordered by the total duration. Even though this is a small sample, duration is considered to be a $sum of many items.

{ 
    "tracked_item_type": "Site",
    "tracked_item_name": "Digital Blasphemy",
    "duration" : 8000
}
{ 
    "tracked_item_type": "Site",
    "tracked_item_name": "Facebook",
    "duration" : 7920
}
{ 
    "tracked_item_type": "Software",
    "tracked_item_name": "Word",
    "duration" : 9540
}
{ 
    "tracked_item_type": "Software",
    "tracked_item_name": "Notepad",
    "duration" : 4000
}

Aggregate Solution

This was my lengthy way to solve the problem

db.collection.aggregate([

    // Group on the types and "sum" of duration
    {"$group": {
        "_id": {
            "tracked_item_type": "$tracked_item_type",
            "tracked_item_name": "$tracked_item_name"
         },
        "duration": {"$sum": "$duration"}
    }},

    // Sort by type and duration descending
    {"$sort": { "_id.tracked_item_type": 1, "duration": -1 }},

    /* The fun part */

    // Re-shape results to "sites" and "software" arrays 
    {"$group": { 
        "_id": null,
        "sites": {"$push":
            {"$cond": [
                {"$eq": ["$_id.tracked_item_type", "Site" ]},
                { "_id": "$_id", "duration": "$duration" },
                null
            ]}
        },
        "software": {"$push":
            {"$cond": [
                {"$eq": ["$_id.tracked_item_type", "Software" ]},
                { "_id": "$_id", "duration": "$duration" },
                null
            ]}
        }
    }},


    // Remove the null values for "software"
    {"$unwind": "$software"},
    {"$match": { "software": {"$ne": null} }},
    {"$group": { 
        "_id": "$_id",
        "software": {"$push": "$software"}, 
        "sites": {"$first": "$sites"} 
    }},

    // Remove the null values for "sites"
    {"$unwind": "$sites"},
    {"$match": { "sites": {"$ne": null} }},
    {"$group": { 
        "_id": "$_id",
        "software": {"$first": "$software"},
        "sites": {"$push": "$sites"} 
    }},


    // Project out software and limit to the *top* 2 results
    {"$unwind": "$software"},
    {"$project": { 
        "_id": 0,
        "_id": { "_id": "$software._id", "duration": "$software.duration" },
        "sites": "$sites"
    }},
    {"$limit" : 2},


    // Project sites, grouping multiple software per key, requires a sort
    // then limit the *top* 2 results
    {"$unwind": "$sites"},
    {"$group": {
        "_id": { "_id": "$sites._id", "duration": "$sites.duration" },
        "software": {"$push": "$_id" }
    }},
    {"$sort": { "_id.duration": -1 }},
    {"$limit": 2}

])

The "Not quite there yet" Output

And the point where aggregation falls short of getting to the final result. At least to my current understanding.

{
    "result" : [
        {
            "_id" : {
                "_id" : {
                    "tracked_item_type" : "Site",
                    "tracked_item_name" : "Digital Blasphemy"
                 },
                 "duration" : 8000
           },
            "software" : [
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Word"
                    },
                    "duration" : 9540
                },

                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Notepad"
                    },
                    "duration" : 4000
                }
            ]
        },
        {
            "_id" : {
                "_id" : {
                    "tracked_item_type" : "Site",
                    "tracked_item_name" : "Facebook"
                },
                "duration" : 7920
            },
            "software" : [
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Word"
                    },
                    "duration" : 9540
                },
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Notepad"
                    },
                    "duration" : 4000
                }
            ]
        }
    ],
    "ok" : 1
}

This all seemed very reasonable (to Me anyway) that the result, while not complete could be post-processed in code in order to massage it into the desired form.

But indeed, it seems an exercise, and a point of intrigue as to whether this could be achieved with the use of any upcoming features for aggregation (or possibly another technique that has eluded me) to get to the desired result form.

So feel free to answer with any, suggestions / pointers as to how this could be achieved.

回答1:

Here is an aggregation that finds the top two by duration in each category (it does break "ties" arbitrarily, which seems to be in line with your sample output):

var pregroup = { "$group" : {
        "_id" : {
            "type" : "$tracked_item_type",
            "name" : "$tracked_item_name"
        },
        "duration" : {
            "$sum" : "$duration"
        }
    }
};
var sort = { "$sort" : { "_id.type" : 1, "duration" : -1 } };
var group1 = { "$group" : {
        "_id" : "$_id.type",
        "num1" : {
            "$first" : {
                "name" : "$_id.name",
                "dur" : "$duration"
            }
        },
        "other" : {
            "$push" : {
                "name" : "$_id.name",
                "dur" : "$duration"
            }
        },
    "all" : {
        "$push" : {
            "name" : "$_id.name",
            "dur" : "$duration"
        }
    }
    }
};
var unwind = { "$unwind" : "$other" };
project = {
    "$project" : {
        "keep" : {
            "$ne" : [
                "$num1.name",
                "$other.name"
            ]
        },
        "num1" : 1,
        "all" : 1,
        "other" : 1
    }
};
var match = { "$match" : { "keep" : true } };
var sort2 = { "$sort" : { "_id" : 1, "other.dur" : -1 } };
var group2 = { "$group" : {
        "_id" : "$_id",
        "numberOne" : {
            "$first" : "$num1"
        },
        "numberTwo" : {
            "$first" : "$other"
        },
    "all" : {
        "$first" : "$all"
    }
    }
};
unwind2 = { "$unwind" : "$all" };
project2 = { "$project" : {
    "_id" : 0,
    "tracked_item_type" : "$_id",
    "tracked_item_name" : {
        "$cond" : [
            {
                "$or" : [
                    {
                        "$eq" : [
                            "$all.name",
                            "$numberOne.name"
                        ]
                    },
                    {
                        "$eq" : [
                            "$all.name",
                            "$numberTwo.name"
                        ]
                    }
                ]
            },
            "$all.name",
            null
        ]
    },
    "duration" : {
        "$cond" : [
            {
                "$or" : [
                    {
                        "$eq" : [
                            "$all.name",
                            "$numberOne.name"
                        ]
                    },
                    {
                        "$eq" : [
                            "$all.name",
                            "$numberTwo.name"
                        ]
                    }
                ]
            },
            "$all.dur",
            null
        ]
    }
}
}
match2 = { "$match" : { "tracked_item_name" : { "$ne" : null } } };

Running this with your sample data:

db.top2.aggregate(pregroup, sort, group1, unwind, project, match, sort2, group2, unwind2, project2, match2).toArray()
[
    {
        "tracked_item_type" : "Software",
        "tracked_item_name" : "Word",
        "duration" : 9540
    },
    {
        "tracked_item_type" : "Software",
        "tracked_item_name" : "Notepad",
        "duration" : 4000
    },
    {
        "tracked_item_type" : "Site",
        "tracked_item_name" : "Digital Blasphemy",
        "duration" : 8000
    },
    {
        "tracked_item_type" : "Site",
        "tracked_item_name" : "Facebook",
        "duration" : 7920
    }
]

This will work with arbitrary number of domains (different tracked item type values) and you don't need to know all their names in advance. However, to generalize it to top three, top four, top five, etc. will add four more stages for each additional top "N" value - not very practical or pretty.

Please vote up this jira ticket to get a more native implementation of "top N" functionality in the aggregation framework.

回答2:

Got my own answer in first!

I wasn't expecting that, but there is an answer that can be found in the implementation of 2.6 which includes some great new Set Operators.

So what I thought about (eventually) was that the problem came down to having two separate lists that need to be one, so how to merge these items so that they all fall under one field. So there is the obvious operator for this, $setUnion.

Here's the new fragment first, which I'll explain in parts:

// So this part just "normalizes" a little so we get one record that essentially has
// two arrays in it
{"$group": { 
    _id: { _id: null, software: "$software"  },
    sites: {$push:"$_id" }
}},

And the resulting document:

{
    "_id" : {
        "_id" : null,
        "software" : [
             {
                 "_id" : {
                     "tracked_item_type" : "Software",
                     "tracked_item_name" : "Word"
                 },
             "duration" : 9540
             },
             {
                 "_id" : {
                     "tracked_item_type" : "Software",
                     "tracked_item_name" : "Notepad"
             },
             "duration" : 4000
         }
        ]
    },
    "sites" : [
        {
            "_id" : {
                "tracked_item_type" : "Site",
                "tracked_item_name" : "Digital Blasphemy"
            },
            "duration" : 8000
        },
        {
            "_id" : {
                "tracked_item_type" : "Site",
                "tracked_item_name" : "Facebook"
            },
            "duration" : 7920
        }
    ]
}

So that document, which is arguably a better form of the result than where I left of before, considering the items are no longer duplicated, essentially has the two lists we want to merge as one. So all there is to do now is use the operator that facilitates this merge:

// Then we just project with a new field, and the "$setUnion" of the two arrays
{"$project": { 
    "_id": 0,
    "records": {"$setUnion": ["$_id.software", "$sites"]} 
}},

And that brings us to here:

{
    "records" : [
        {
            "_id" : {
                "tracked_item_type" : "Site",
                "tracked_item_name" : "Facebook"
            },
            "duration" : 7920
        },
        {
            "_id" : {
                "tracked_item_type" : "Software",
                "tracked_item_name" : "Word"
            },
            "duration" : 9540
        },
        {
            "_id" : {
                "tracked_item_type" : "Site",
                "tracked_item_name" : "Digital Blasphemy"
            },
            "duration" : 8000
        },
        {
            "_id" : {
                "tracked_item_type" : "Software",
                "tracked_item_name" : "Notepad"
            },
            "duration" : 4000
        }
    ]
}

And basically that's it. Now we just have the four items, so with a little "unwinding", projection and a sort, we'll get the exact result that I was looking for.

So here's the whole thing, just for the record:

db.collection.aggregate([

    // Group on the types and "sum" of duration
    {"$group": {
        "_id": {
            "tracked_item_type": "$tracked_item_type",
            "tracked_item_name": "$tracked_item_name"
         },
        "duration": {"$sum": "$duration"}
    }},

    // Sort by type and duration descending
    {"$sort": { "_id.tracked_item_type": 1, "duration": -1 }},

    /* The fun part */

    // Re-shape results to "sites" and "software" arrays 
    {"$group": { 
        "_id": null,
        "sites": {"$push":
            {"$cond": [
                {"$eq": ["$_id.tracked_item_type", "Site" ]},
                { "_id": "$_id", "duration": "$duration" },
                null
            ]}
        },
        "software": {"$push":
            {"$cond": [
                {"$eq": ["$_id.tracked_item_type", "Software" ]},
                { "_id": "$_id", "duration": "$duration" },
                null
            ]}
        }
    }},


    // Remove the null values for "software"
    {"$unwind": "$software"},
    {"$match": { "software": {"$ne": null} }},
    {"$group": { 
        "_id": "$_id",
        "software": {"$push": "$software"}, 
        "sites": {"$first": "$sites"} 
    }},

    // Remove the null values for "sites"
    {"$unwind": "$sites"},
    {"$match": { "sites": {"$ne": null} }},
    {"$group": { 
        "_id": "$_id",
        "software": {"$first": "$software"},
        "sites": {"$push": "$sites"} 
    }},


    // Project out software and limit to the *top* 2 results
    {"$unwind": "$software"},
    {"$project": { 
        "_id": 0,
        "_id": { "_id": "$software._id", "duration": "$software.duration" },
        "sites": "$sites"
    }},
    {"$limit" : 2},


    // Project sites, grouping multiple software per key, requires a sort
    // then limit the *top* 2 results
    {"$unwind": "$sites"},
    {"$group": {
        "_id": { "_id": "$sites._id", "duration": "$sites.duration" },
        "software": {"$push": "$_id" }
    }},
    {"$sort": { "_id.duration": -1 }},
    {"$limit": 2},

    // So this part just "normalizes" a little so we get one record that
    // essentially has two arrays in it
    {"$group": { 
        _id: { _id: null, software: "$software"  },
        sites: {$push:"$_id" }
    }},

    // Then we just project with a new field, and the "$setUnion" of the two arrays
    {"$project": { 
        "_id": 0,
       "records": {"$setUnion": ["$_id.software", "$sites"]} 
    }},

    // Unwind the array to documents
    {"$unwind": "$records"},

    // Shape the final output
    {"$project": { 
        "tracked_item_type": "$records._id.tracked_item_type",
        "tracked_item_name": "$records._id.tracked_item_name",
        "duration": "$records.duration"
    }},

     // Final sort on the result
    {"$sort": { "tracked_item_type": 1, "duration": -1 }} 

])

Obviously there is a drop-off point where the whole approach becomes impractical, as the general premise is derived from being able to $push all of the documents into their own arrays in order that the top results can then be pulled out by eventually calling $limit on those results.

As such, where there would be a large number of results for each "category", it is then likely a more practical approach to process the "category" individually, and then simply limit each of those results to the top two items that are required.

But as an exercise, at least I know now that this can be done. Hope this was all useful to someone.

I Am Still interested to see if anyone has another approach though.