Match at least “N” elements of an array to a list

2019-04-01 11:00发布

问题:

I have the following scenario: One of my mongo collections has documents in the following format:

user: "test",
tracks: [{artist: "A", ...}, {artist: "B", ...}, ..., { artist: "N", ...}]

I would like to extract all the tracks, whose artists are in a given array arr. For that purpose I'm using the following query (which works fine).

collection.find({ tracks: { $elemMatch: { artist: { $in: arr }}}})

However, now I would like to modify the query so that it returns only those documents in the collection that have tracks performed by at least let's say 3 different artists from the arr array. How can I achieve that (apart from filtering the results after they are returned from the db, which is not an option)?

回答1:

Your question has two possibilities to me, but perhaps some explanation to get you started.

First of all I need to explain to you that you misunderstand the intent of $elemMatch and it is misused in this case.

The idea of $elemMatch is to create a "query document" which is actually applied to the elements of the array. The intent is where you have "multiple conditions" on a document within the array in order to match it discretely within the member document, and not within the whole array of the outer document. i.e:

{
   "data": [
       { "a": 1, "b": 3 },
       { "a": 2, "b": 2 }
   ]
}

And the following query will work, even though no actual single element in that array matches, but the whole document does:

db.collection.find({ "data.a": 1, "data.b": 2 })

But to check if an actual element matches both of those conditions, this is where you use $elemMatch:

db.collection.find({ "data": { "a": 1, "b": 2 } })

So no match in that sample, and it will only match where a specific array element had both of those elements.


Now we have $elemMatch explained, here's your simplified query:

db.collection.find({ "tracks.artist": { "$in": arr } })

Much more simple, and it works by looking at all array members by a single field and returning where any element in the document contains at least one of those possible results.

But not what you are asking, so on with your question. If you read through that last statement you should come to realize that $in is actually an $or condition. It's just a shortened form for asking "or" over the same element in the document.

With that in mind, at the heart of what you are asking is for an "and" operation where all "three" values are contained. Assuming that you were only sending "three" items in the test then you could use a form of $and which is in the shortened form of $all:

db.collection.find({ "tracks.artist": { "$all": arr } })

That would only return you the documents that had the element within members of that array matching "all" of the elements specified in the test condition. That may well be what you want, but there is the case where of course you want to specify a list of say, "four or more" artists to test and only want "three" or some lesser number out of that, in which case an $all operator is too terse.

But there is a logical way to solve this, it just takes a little more processing with operators not available to basic queries but that are available to the aggregation framework:

var arr = ["A","B","C","D"];     // List for testing

db.collection.aggregate([
    // Match conditions for documents to narrow down
    { "$match": {
        "tracks.artist": { "$in": arr },
        "tracks.2": { "$exists": true }      // you would construct in code
    }},

    // Test the array conditions
    { "$project": {
        "user": 1,
        "tracks": 1,                         // any fields you want to keep
        "matched": {
            "$gte": [
                 { "$size": {
                     "$setIntersection": [
                         { "$map": {
                             "input": "$tracks",
                             "as": "t",
                             "in": { "$$t.artist" }
                         }},
                         arr
                     ]
                 }},
                 3
             ]
        }
    }},

    // Filter out anything that did not match
    { "$match": { "matched": true } }
])

The first stage implements a standard query $match condition in order to filter the documents to only those that are "likely" to match the conditions. The logical case here is to use $in as before at it will find those documents where at least one of the elements present in your "test" array is present within at least one of the member fields in the documents own array.

The next clause is something you should ideally be building in code as it relates to the "length" of the array. The idea here is where you want at least "three" matches then the array you are testing in the document must have at least "three" elements in order to meet that, so no point in retrieving documents with "two" or less array elements since they can never match "three".

Since all MongoDB queries are essentially just a representation of a data structure, it makes this very easy to build. i.e, for JavaScript:

var matchCount = 3;    // how many matches we want

var match1 = { "$match": { "tracks.artist": { "$in": arr } } };

match1["$match"]["tracks."+ (matchCount-1)] = { "$exits": true };

The logic there is that the "dot notation" form with $exists tests for the presence of an element at the specified index ( n-1 ), and it needs to be there for the array to at least be of that length.

The rest of the narrowing down ideally uses the $setIntersection method in order to return the matched elements between the actual array and the tested array. Since the array in the document does not match the structure for the "test array" it needs to be transformed via the $map operation which is set to only return the "artist" field from each array element.

As the "intersection" of those two arrays is made, it is finally tested for the $size of that resulting list of common elements where the test is applied to see that "at least three" of those elements were found to be in common.

Finally you just "filter out" anything that was not true using a $match condition.


Ideally you are using MongoDB 2.6 or greater in order to have those operators available. For the earlier versions of 2.2.x and 2.4.x, it's still possible, but just a bit more work and processing overhead:

db.collection.aggregate([
    // Match conditions for documents to narrow down
    { "$match": {
        "tracks.artist": { "$in": arr },
        "tracks.2": { "$exists": true }      // you would construct in code
    }},

    // Unwind the document array
    { "$unwind": "$tracks" },

    // Filter the content
    { "$match": { "tracks.artist": { "$in": arr } }},

    // Group for distinct values
    { "$group": {
        "_id": { 
           "_id": "$_id",
           "artist": "$tracks.artist"
        }
    }},

    // Make arrays with length
    { "$group": {
        "_id": "$_id._id",
        "artist": { "$push": "$_id.artist" },
        "length": { "$sum": 1 }
    }},

    // Filter out the sizes
    { "$match": { "length": { "$gte": 3 } }}
])