MongoDB to assist with recommendations

2019-07-15 04:27发布

问题:

I have a 3 collection schema as shown below:

  1. User collection has information regarding their friends and the listening count(weight) per artist

    {
        user_id : 1,
        Friends : [3,5,6],
        Artists : [
            {artist_id: 10 , weight : 345},
            {artist_id: 17 , weight : 378}
        ]
    }
    
  2. Artist collection schema has information regarding the name of the artist, the tags given by various users to them.

    {
        artistID : 56,
        name : "Ed Sheeran",
        user_tag : [
           {user_id : 2, tag_id : 6},
           {user_id : 2, tag_id : 5},
           {user_id : 3, tag_id : 7}
        ]
    }
    
  3. Tags collection having information about the various tags.

    {tag_id : 3, tag_value : "HipHop"}
    

I want to provide a user with recommendations for artists by using the below rules:

Rule 1 : Find artists listened by the user's friends but not the user, order them by the sum of friends' listening counts.

Rule 2 : select any tag used by the user, find all artist with this tag who are not in the user's listening list, and order them by the number of unique listener.

Can anybody help me write a query to perform the above.

回答1:

You need to do a couple of things here for your end result, but the first stages are relatively simple. Take the user object you provide:

var user = {
    user_id : 1,
    Friends : [3,5,6],
    Artists : [
        {artist_id: 10 , weight : 345},
        {artist_id: 17 , weight : 378}
    ]
};

Now presuming you already have that data retrieved, then this comes down to finding the same structures for each "friend" and filtering out the array content of "Artists" into a single distinct list. Presumably each "weight" will also be considered in total here.

This is a simlple aggregation operation that will first filter out the artists already in in the list for the given user:

var artists = user.Artists.map(function(artist) { return artist.artist_id });

User.aggregate(
    [ 
        // Find possible friends without all the same artists
        { "$match": {
            "user_id": { "$in": user.Friends },
            "Artists.artist_id": { "$nin": artists }
        }},
        // Pre-filter the artists already in the user list
        { "$project": 
            "Artists": {
                "$setDifference": [
                    { "$map": {
                        "input": "$Artists",
                        "as": "$el",
                        "in": {
                            "$cond": [
                                "$anyElementTrue": {
                                    "$map": {
                                        "input": artists,
                                        "as": "artist",
                                        "in": { "$eq": [ "$$artist", "$el.artist_id" ] }
                                    }
                                },
                                false,
                                "$$el"
                            ]
                        } 
                    }}
                    [false]
                ]
            } 
        }},
        // Unwind the reduced array
        { "$unwind": "$Artists" },
        // Group back by each artist and sum weights
        { "$group": {
            "_id": "$Artists.artist_id",
            "weight": { "$sum": "$Artists.weight" }
        }},
        // Sort the results by weight
        { "$sort": { "weight": -1 } }
    ],
    function(err,results) {
        // more to come here
    }
);

The "pre-filter" is the only really tricky part here. You could just $unwind the array and $match again to filter out the entries you don't want. Even though we want to $unwind the results later in order to combine them, it works out more efficient to remove them from the array "first", so there is less to expand.

So here the $map operator allows inspection of each element of the user "Artists" array and also for comparison against the filtered "user" artists list to just return the wanted details. The $setDifference is used to actually "filter" any results that were not returned as the array content, but rather returned as false.

After that there is just the $unwind to de-normalize the content in the array and the $group to bring together a total per artist. For fun we are using $sort to show that the list is returned in desired order, but that will not be necessary at a later stage.

That is at least part of the way along here as the resulting list should only be other artists not already in the user's own list, and sorted by the summed "weight" from any artists that could possibly appear on multiple friends.

The next part is going to need data from the "artists" collection in order to take the number of listeners into account. Whilst mongoose has a .populate() method, you really don't want this here as you are looking for the "distinct user" counts. This implies another aggregation implementation in order to get those distinct counts for each artist.

Following on from the result list of the previous aggregation operation, you would use the $_id values like this:

// First get just an array of artist id's
var artists = results.map(function(artist) {
    return artist._id;
});

Artist.aggregate(
    [
        // Match artists
        { "$match": {
            "artistID": { "$in": artists }
        }},
        // Project with weight for distinct users
        { "$project": {
            "_id": "$artistID",
            "weight": {
                "$multiply": [
                    { "$size": {
                        "$setUnion": [
                            { "$map": {
                                "input": "$user_tag",
                                "as": "tag",
                                "in": "$$tag.user_id"
                            }},
                            []
                        ]
                    }},
                    10
                ]
            }
        }}
    ],
    function(err,results) {
        // more later
    }
);

Here the trick is done in aggregate with $map to do a similar transform of values which is fed to $setUnion to make them a unique list. Then the $size operator is applied to find out how big that list is. The additional math is to give that number some meaning when applied against the already recorded weights from the previous results.

Of course you need to bring all of this together somehow, as right now there are just two distinct sets of results. The basic process is a "Hash Table", where the unique "artist" id values are used as a key and the "weight" values are combined.

You can do this in a number of ways, but since there is a desire to "sort" the combined results then my prefernce would be something "MongoDBish" since it follows the basic methods you already should be used to.

A handy way to implement this is using nedb, which provides an "in memory" store that uses much of the same type of methods as used to read and write to MongoDB collections.

This also scales well if you needed to use an actual collection for large results, as all the principles remain the same.

  1. First aggregation operation inserts new data to the store

  2. Second aggregation "updates" that data an increments the "weight" field

As a complete function listing, and with some other help of the async library it would look like this:

function GetUserRecommendations(userId,callback) {

    var async = require('async')
        DataStore = require('nedb');

    User.findOne({ "user_id": user_id},function(err,user) {
        if (err) callback(err);

        var artists = user.Artists.map(function(artist) {
            return artist.artist_id;
        });

        async.waterfall(
            [
                function(callback) {
                    var pipeline =  [ 
                        // Find possible friends without all the same artists
                        { "$match": {
                            "user_id": { "$in": user.Friends },
                            "Artists.artist_id": { "$nin": artists }
                        }},
                        // Pre-filter the artists already in the user list
                        { "$project": 
                            "Artists": {
                                "$setDifference": [
                                    { "$map": {
                                        "input": "$Artists",
                                        "as": "$el",
                                        "in": {
                                            "$cond": [
                                                "$anyElementTrue": {
                                                    "$map": {
                                                        "input": artists,
                                                        "as": "artist",
                                                        "in": { "$eq": [ "$$artist", "$el.artist_id" ] }
                                                    }
                                                },
                                                false,
                                                "$$el"
                                            ]
                                        } 
                                    }}
                                    [false]
                                ]
                            } 
                        }},
                        // Unwind the reduced array
                        { "$unwind": "$Artists" },
                        // Group back by each artist and sum weights
                        { "$group": {
                            "_id": "$Artists.artist_id",
                            "weight": { "$sum": "$Artists.weight" }
                        }},
                        // Sort the results by weight
                        { "$sort": { "weight": -1 } }
                    ];

                    User.aggregate(pipeline, function(err,results) {
                        if (err) callback(err);

                        async.each(
                            results,
                            function(result,callback) {
                                result.artist_id = result._id;
                                delete result._id;
                                DataStore.insert(result,callback);
                            },
                            function(err)
                                callback(err,results);
                            }
                        );

                    });
                },
                function(results,callback) {

                    var artists = results.map(function(artist) {
                        return artist.artist_id;  // note that we renamed this
                    });

                    var pipeline = [
                        // Match artists
                        { "$match": {
                            "artistID": { "$in": artists }
                        }},
                        // Project with weight for distinct users
                        { "$project": {
                            "_id": "$artistID",
                            "weight": {
                                "$multiply": [
                                    { "$size": {
                                        "$setUnion": [
                                            { "$map": {
                                                "input": "$user_tag",
                                                "as": "tag",
                                                "in": "$$tag.user_id"
                                            }},
                                            []
                                        ]
                                    }},
                                    10
                                ]
                            }
                        }}
                    ];

                    Artist.aggregate(pipeline,function(err,results) {
                        if (err) callback(err);
                        async.each(
                            results,
                            function(result,callback) {
                                result.artist_id = result._id;
                                delete result._id;
                                DataStore.update(
                                    { "artist_id": result.artist_id },
                                    { "$inc": { "weight": result.weight } },
                                    callback
                                );
                            },
                            function(err) {
                                callback(err);
                            }
                        );
                    });
                }
            ],
            function(err) {
                if (err) callback(err);     // callback with any errors
                // else fetch the combined results and sort to callback
                DataStore.find({}).sort({ "weight": -1 }).exec(callback);
            }
        );

    });

}

So after matching the initial source user object the values are passed into the first aggregate function, which is executing in series and using async.waterfall to pass it's result.

Before that happens though the aggregation results are added to the DataStore with regular .insert() statements, taking care to rename the _id fields as nedb does not like anything other than it's own self generated _id values. Each result is inserted with artist_id and weight properties from the aggregation result.

That list is then passed to the second aggregation operation which is going to return each specified "artist" with a calculated "weight" based on the distinct user size. There are the "updated" with the same .update() statement on the DataStore for each artist and incrementing the "weight" field.

All going well, the final operation is to .find() those results and .sort() them by the combined "weight", and simply return the result to the passed in callback to the function.

So you would use it like this:

GetUserRecommendations(1,function(err,results) {
   // results is the sorted list
});

And it is going to return all of the artists not presently in that user's list but in their friends lists and ordered by the combined weights of the friend listening count plus the score from the number of distinct users of that artist.

This is how you deal with data from two different collections that you need to combine into a single result with various aggregated details. It's multiple queries and a working space, but also part of the MongoDB philosopy that such operations are better performed this way than throwing them at the database to "join" results.