Filter results by the Last Array Entry Field Value

2019-04-12 06:08发布

问题:

Having this document structure (omitting irrelevant fields for brevity):

[
    {
        "_id" : 0,
        "partn" : [ 
            {
                "date" : ISODate("2015-07-28T00:59:14.963Z"),
                "is_partner" : true
            }, 
            {
                "date" : ISODate("2015-07-28T01:00:32.771Z"),
                "is_partner" : false
            }, 
            {
                "date" : ISODate("2015-07-28T01:15:29.916Z"),
                "is_partner" : true
            }, 
            {
                "date" : ISODate("2015-08-05T13:48:07.035Z"),
                "is_partner" : false
            }, 
            {
                "date" : ISODate("2015-08-05T13:50:56.482Z"),
                "is_partner" : true
            }
        ]
    },
    {
        "_id" : 149,
        "partn" : [ 
            {
                "date" : ISODate("2015-07-30T12:42:18.894Z"),
                "is_partner" : true
            }, 
            {
                "date" : ISODate("2015-07-31T00:01:51.176Z"),
                "is_partner" : false
            }
        ]
    }
]

I need to filter documents where the last (most recent) partn.is_partner is true, is this the best way to do it?

db.somedb
    .aggregate([ 
        // pre-filter only the docs with at least one is_partner === true, is it efficient/needed?
        {$match: {partn: { $elemMatch: { is_partner: true } } } },
        {$unwind: '$partn'},
        // do I need to sort by _id too, here?
        {$sort: {_id: 1, 'partn.date': 1} },
        // then group back fetching the last one by _id
        {$group : {
           _id : '$_id',
           partn: {$last: '$partn'},
        }},
        // and return only those with is_partner === true
        {$match: {'partn.is_partner': true } },
    ])

I get what I need but, being a not-that-experient mongodb developer something feels like overhead in that aggregation. I thought about just fetching the last entry on each .partn array, but the collection must be exported/imported sometimes, if I remember right the sort order can be changed - so aggregating and sorting by date could fail-proof that aspect.

Is this the best (most efficient) way to do it? If not, why?

Thanks. (Btw, this is MongoDB 2.6)

回答1:

Mileage may vary on this and it may well turn out that "currently" the process you are following works out to be "most suited" at least. But we can probably do more efficient.

What you could do now

Provided your arrays are already "sorted" via using the $sort modifier with $push, then you can probably do this:

db.somedb.find(
  { 
    "partn.is_partner": true,
    "$where": function() {
      return this.partn.slice(-1)[0].is_partner == true;
    }
  },
  { "partn": { "$slice": -1 } }
)

So as long as partn,is_partner is "indexed" this is still pretty efficient as that initial query condition can be met using an index. The part that cannot is the $where clause here that uses JavaScript evaluation.

But what that second part in the $where is doing is simply "slicing" the last element from the array and testing it's value of the is_partner property to see if it is true. Only if that condition is also met is the document returned.

There is also the $slice projection operator. This does the same thing in returning the last element from the array. False matches are already filtered, so this is just showing only the last element where true.

Combined with with the index as mentioned, then this should be pretty quick given that the documents have been selected already and the JavaScript condition just filters the rest. Note that without another field with a standard query condition to match, a $where clause cannot use an index. So always try to use "sparingly" with other query conditions in place.

What you can do in the future

Next Up, while not available at the time of writing, but certainly in the near future will be the $slice operator for the aggregation framework. This is currently in the develpment branch, but here is a peek at how it works:

db.somedb.aggregate([
  { "$match": { "partn.is_partner": true } },
  { "$redact": {
    "$cond": {
      "if": { 
        "$anyElementTrue": {
          "$map": {
            "input": { "$slice": ["$partn",-1] },
            "as": "el",
            "in": "$$el.is_partner"
          }
        }
      },
      "then": "$$KEEP",
      "else": "$$PRUNE"
    }
  }},
  { "$project": {
      "partn": { "$slice": [ "$partn",-1 ] }
  }}
])

Combining that $slice within a $redact stage here alows the documents to be filtered with a logical condition, testing the document. In this case the $slice produces a single element array that is sent to $map in order to just extract the single is_partner value ( still as an array ). As this is still a single element array at best, the other test is $anyElementTrue which makes this a singular boolean result, suitable for $cond.

The $redact here decides on that result whether to $$KEEP or $$PRUNE the document from the results. Later we use $slice again in the project to just return the last element of the array after the filtering.

That works out to be pretty much exactly what the JavaScript version does, with the exception that this is using all native coded operators, and therefore should be a bit faster than the JavaScript alternate.

Both forms return your first document as expected:

{
    "_id" : 0,
    "partn" : [
            {
                    "date" : ISODate("2015-07-28T00:59:14.963Z"),
                    "is_partner" : true
            },
            {
                    "date" : ISODate("2015-07-28T01:00:32.771Z"),
                    "is_partner" : false
            },
            {
                    "date" : ISODate("2015-07-28T01:15:29.916Z"),
                    "is_partner" : true
            },
            {
                    "date" : ISODate("2015-08-05T13:48:07.035Z"),
                    "is_partner" : false
            },
            {
                    "date" : ISODate("2015-08-05T13:50:56.482Z"),
                    "is_partner" : true
            }
    ]
}

The big catch here with both is your array must already be sorted so the latest date is first. Without that, then you need the aggregation framework to $sort the array, just as you are doing now.

Not really efficient, so that is why you should "pre-sort" your array and maintain the order on each update.

As a handy trick, this will actually re-order all the array elements in all collection documents in one simple statement:

db.somedb.update(
    {},
    { "$push": { 
        "partn": { "$each": [], "$sort": { "date": 1 } }
    }},
    { "multi": true }
)

So even if you are not "pushing" a new element into an array and just updating a property, you can always apply that basic construct to keep the array ordered how you want it.

Worth considering as it should make things much faster.