Range support over time bucketed analytics

2019-05-11 10:29发布

Read this, and tried experimenting it.

Below is my schema.

SCHEMA:

{
    "s"  : "CB",
    "c"  : "REQ_RCV",
    "e"  : "sms_click",
    "st" : "i",
    "b"  : "2",
    "a"  : "1",
    "u"  : "b1_h1_d1_m1_user_2",
    "c#" : "b1_h1_d1_m1_cr-2",
    "@"  : ISODate("2016-10-01T06:03:00.000Z"), //Indexed
    "@h" : "16100106", //hourly bucket
    "@d" : "161001",                            //Indexed
    "@m" : "1610"
}

And below is the explain plan:

> 2017-01-22T13:43:47.764+0530 I COMMAND  [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @: { $gte: new
> Date(1483228800000), $lte: new Date(1483315199000) } } }, { $group: {
> _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0 } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s", TIME:
> "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET: 1.0,
> TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @: 1.0 }
> keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:925 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1860 } }, Database: { acquireCount: { r:
> 930 } }, Collection: { acquireCount: { r: 929 } } }
> protocol:op_command **3499ms**


> 2017-01-22T13:44:24.825+0530 I COMMAND  [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @d: "170101" } }, {
> $group: { _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0
> } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s",
> TIME: "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET:
> 1.0, TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @d: 1.0 } keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:865 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1740 } }, Database: { acquireCount: { r:
> 870 } }, Collection: { acquireCount: { r: 869 } } }
> protocol:op_command **1294ms**

Questions:

  1. Though both the queries are examined same number of documents Why there is time difference in the output?
  2. Are $lte, $gte operators slow only on date ranges or even slow on numeric comparison as well?
  3. Since bucketing gives faster response, how do I use bucketing for range queries? I can make multiple aggregate time bucket calls to support range queries, but that will make more round trip time, any suggestions?
  4. Is it possible to use $and for two $match in a aggregate query to support range bucketing? Currently aggregate accepts multiple $match but first output of $match is given to second $match, but what I want is adding/grouping of individual $match results to the next pipeline.

Possible answer for Q4:

db.template3.aggregate([ 
    {
        $match: {
            $or: [
                {"@d":"170301"},
                {"@d":"170306"}, 
                {"@d":"170202"}, 
                {"@d":"170303"},
                {"@d":"170304"}, 
                {"@d":"170305"}
            ]
        }
    },
    { $project: { _id: 0, "b": 1, "s": 1, "@h": 1 } }, 
    {
        $group: {
            _id: {"b": "$b", "HOURLY": "$@h", "s": "$s" }, 
            count: { $sum: 1 } 
        }
    },  
    { 
        $project: {
            _id: 0, 
            "BUCKET": "$_id.b",
            "SERVICE": "$_id.s",
            "TIME": "$_id.HOURLY", 
            count: 1
        }
    },         
    { $sort: { "SERVICE": 1, "BUCKET": 1, "TIME": 1 } } 
]);

In this answer we can use mixes (daily and monthly) of buckets and still this will use its own index. Read https://docs.mongodb.com/manual/reference/operator/query/or/#or-clauses-and-indexes.

Sample query:

db.template3.aggregate([ 

    {$match:{"@h":{$gte : 17020511, $lte : 17030511}, "st":"i"}}, 

    {$project : {"_id":0,  "@h":1,"c":1, "@m":1}}, 

    {$group:{_id:{ "HOURLY":"$@h", "c":"$c"}, count:{$sum:1}}},

    {$project : {_id:0, "COUNTER":"$_id.c","TIME":"$_id.HOURLY", count:1}}, 

    {$sort:{"COUNTER":1,"TIME":1}}
]);

Output:

{ "count" : 2255, "COUNTER" : "REQ_RCVD", "TIME" : 17020511 }
{ "count" : 28888, "COUNTER" : "REQ_RCVD", "TIME" : 17020600 }
{ "count" : 37613, "COUNTER" : "REQ_RCVD", "TIME" : 17020601 }
{ "count" : 6723, "COUNTER" : "REQ_RCVD", "TIME" : 17020602 }
{ "count" : 14057, "COUNTER" : "REQ_RCVD", "TIME" : 17020603 }
{ "count" : 12405, "COUNTER" : "REQ_RCVD", "TIME" : 17020604 }
{ "count" : 2392, "COUNTER" : "REQ_RCVD", "TIME" : 17020611 }
{ "count" : 28784, "COUNTER" : "REQ_RCVD", "TIME" : 17020700 }
{ "count" : 37494, "COUNTER" : "REQ_RCVD", "TIME" : 17020701 }
{ "count" : 6697, "COUNTER" : "REQ_RCVD", "TIME" : 17020702 }
{ "count" : 13930, "COUNTER" : "REQ_RCVD", "TIME" : 17020703 }
{ "count" : 12493, "COUNTER" : "REQ_RCVD", "TIME" : 17020704 }
{ "count" : 2225, "COUNTER" : "REQ_RCVD", "TIME" : 17020711 }
{ "count" : 28821, "COUNTER" : "REQ_RCVD", "TIME" : 17020800 }
{ "count" : 37949, "COUNTER" : "REQ_RCVD", "TIME" : 17020801 }
{ "count" : 6676, "COUNTER" : "REQ_RCVD", "TIME" : 17020802 }
{ "count" : 14039, "COUNTER" : "REQ_RCVD", "TIME" : 17020803 }
{ "count" : 12349, "COUNTER" : "REQ_RCVD", "TIME" : 17020804 }
{ "count" : 2332, "COUNTER" : "REQ_RCVD", "TIME" : 17020811 }
{ "count" : 28379, "COUNTER" : "REQ_RCVD", "TIME" : 17020900 }

OPTIMIZATION

  1. I felt more time is taken because of reading of non indexed fields. hence docsExamined: 106888

  2. There is somehow an improvement when I use $project before $group

  3. I have changed "@h" data type from String to Integer (NumberInt), I think it will improve some more.

1条回答
做个烂人
2楼-- · 2019-05-11 11:14

Let's get to your question one by one:

Though both the queries examined the same number of documents Why there is time difference in the output?

Looking at performance metrics from just a single execution is really not how it works. You should take the average over several executions before concluding because there are several factors at play. That being said, MongoDB caches most frequently used documents in memory and keeps it there unless it has to yield memory for some other document. So if a query access documents already cached from a previous query, it should be faster.

Also in MongoDB aggregation uses indexes only at the beginning if any. For example $match and $sortphases can use indexes. In your case $match is the first pipeline stage so that's a win.

Is $lte, $gte is slow only on date range or it is even slow on numeric comparison as well ..?

In MongoDB data is stored in BSON, so dates are basically numbers when they are compared. So there is no difference.

Since bucketing gives faster response, how to use bucketing for range queries? I can make multiple aggregate time bucket calls to support range queries, but that will make more round trip time, any suggestions?

Although I've not tested it I really doubt that the time_bucket approach will give faster response. Since created_at will always increase, the index, in this case, will also be appended to the end without the time_bucket. Furthermore, the index size will be comparatively huge when created on an array than on a simple date field. Will that not cause the issue of fitting the index in RAM.

Using a time_bucket makes sense when you are using some function on the date field before matching. If you extract only year from date field before matching, it will make the existing index on the date useless.

It's always better to cast your parameters to match the datatype in database instead of the other way around.

Is that possible $and of two $match in an aggregate query to support range bucketing. Currently Aggregate accepts multiple $match but first output of $match is given to second $match, but what I want is adding/grouping of individual $match results to the next pipeline.

Yes, it's possible. If it's $and, you just specify all your filters separated by commas in the $match phase. If it's $or use the $or operator.

If you have two $macth phases one by one MongoDB combines it to one. So you don't need to worry about adding results of multiple match phases.

Now your Optimization Points

I felt more time is taken because of reading of non-indexed fields. hence docsExamined:106888

Yes, covered queries are much faster.

There is some betterment when I use $project before $group

If the size of documents is reduced in the $group stage by use of $project, then yes it's true.

I have changed @h dataType from string to int (NumberInt), I think it will improve some more.

It's not necessarily true but it's generally the case. You may check this answer.

查看更多
登录 后发表回答