Range support over time bucketed analytics

Read this, and tried experimenting it.

Below is my schema.

SCHEMA:

{
    "s"  : "CB",
    "c"  : "REQ_RCV",
    "e"  : "sms_click",
    "st" : "i",
    "b"  : "2",
    "a"  : "1",
    "u"  : "b1_h1_d1_m1_user_2",
    "c#" : "b1_h1_d1_m1_cr-2",
    "@"  : ISODate("2016-10-01T06:03:00.000Z"), //Indexed
    "@h" : "16100106", //hourly bucket
    "@d" : "161001",                            //Indexed
    "@m" : "1610"
}

And below is the explain plan:

> 2017-01-22T13:43:47.764+0530 I COMMAND  [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @: { $gte: new
> Date(1483228800000), $lte: new Date(1483315199000) } } }, { $group: {
> _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0 } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s", TIME:
> "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET: 1.0,
> TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @: 1.0 }
> keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:925 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1860 } }, Database: { acquireCount: { r:
> 930 } }, Collection: { acquireCount: { r: 929 } } }
> protocol:op_command **3499ms**


> 2017-01-22T13:44:24.825+0530 I COMMAND  [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @d: "170101" } }, {
> $group: { _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0
> } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s",
> TIME: "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET:
> 1.0, TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @d: 1.0 } keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:865 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1740 } }, Database: { acquireCount: { r:
> 870 } }, Collection: { acquireCount: { r: 869 } } }
> protocol:op_command **1294ms**

Questions:

Though both the queries are examined same number of documents Why there is time difference in the output?
Are $lte, $gte operators slow only on date ranges or even slow on numeric comparison as well?
Since bucketing gives faster response, how do I use bucketing for range queries? I can make multiple aggregate time bucket calls to support range queries, but that will make more round trip time, any suggestions?
Is it possible to use $and for two $match in a aggregate query to support range bucketing? Currently aggregate accepts multiple $match but first output of $match is given to second $match, but what I want is adding/grouping of individual $match results to the next pipeline.

Possible answer for Q4:

db.template3.aggregate([ 
    {
        $match: {
            $or: [
                {"@d":"170301"},
                {"@d":"170306"}, 
                {"@d":"170202"}, 
                {"@d":"170303"},
                {"@d":"170304"}, 
                {"@d":"170305"}
            ]
        }
    },
    { $project: { _id: 0, "b": 1, "s": 1, "@h": 1 } }, 
    {
        $group: {
            _id: {"b": "$b", "HOURLY": "$@h", "s": "$s" }, 
            count: { $sum: 1 } 
        }
    },  
    { 
        $project: {
            _id: 0, 
            "BUCKET": "$_id.b",
            "SERVICE": "$_id.s",
            "TIME": "$_id.HOURLY", 
            count: 1
        }
    },         
    { $sort: { "SERVICE": 1, "BUCKET": 1, "TIME": 1 } } 
]);

In this answer we can use mixes (daily and monthly) of buckets and still this will use its own index. Read https://docs.mongodb.com/manual/reference/operator/query/or/#or-clauses-and-indexes.

Sample query:

db.template3.aggregate([ 

    {$match:{"@h":{$gte : 17020511, $lte : 17030511}, "st":"i"}}, 

    {$project : {"_id":0,  "@h":1,"c":1, "@m":1}}, 

    {$group:{_id:{ "HOURLY":"$@h", "c":"$c"}, count:{$sum:1}}},

    {$project : {_id:0, "COUNTER":"$_id.c","TIME":"$_id.HOURLY", count:1}}, 

    {$sort:{"COUNTER":1,"TIME":1}}
]);

Output:

{ "count" : 2255, "COUNTER" : "REQ_RCVD", "TIME" : 17020511 }
{ "count" : 28888, "COUNTER" : "REQ_RCVD", "TIME" : 17020600 }
{ "count" : 37613, "COUNTER" : "REQ_RCVD", "TIME" : 17020601 }
{ "count" : 6723, "COUNTER" : "REQ_RCVD", "TIME" : 17020602 }
{ "count" : 14057, "COUNTER" : "REQ_RCVD", "TIME" : 17020603 }
{ "count" : 12405, "COUNTER" : "REQ_RCVD", "TIME" : 17020604 }
{ "count" : 2392, "COUNTER" : "REQ_RCVD", "TIME" : 17020611 }
{ "count" : 28784, "COUNTER" : "REQ_RCVD", "TIME" : 17020700 }
{ "count" : 37494, "COUNTER" : "REQ_RCVD", "TIME" : 17020701 }
{ "count" : 6697, "COUNTER" : "REQ_RCVD", "TIME" : 17020702 }
{ "count" : 13930, "COUNTER" : "REQ_RCVD", "TIME" : 17020703 }
{ "count" : 12493, "COUNTER" : "REQ_RCVD", "TIME" : 17020704 }
{ "count" : 2225, "COUNTER" : "REQ_RCVD", "TIME" : 17020711 }
{ "count" : 28821, "COUNTER" : "REQ_RCVD", "TIME" : 17020800 }
{ "count" : 37949, "COUNTER" : "REQ_RCVD", "TIME" : 17020801 }
{ "count" : 6676, "COUNTER" : "REQ_RCVD", "TIME" : 17020802 }
{ "count" : 14039, "COUNTER" : "REQ_RCVD", "TIME" : 17020803 }
{ "count" : 12349, "COUNTER" : "REQ_RCVD", "TIME" : 17020804 }
{ "count" : 2332, "COUNTER" : "REQ_RCVD", "TIME" : 17020811 }
{ "count" : 28379, "COUNTER" : "REQ_RCVD", "TIME" : 17020900 }

OPTIMIZATION

I felt more time is taken because of reading of non indexed fields. hence docsExamined: 106888
There is somehow an improvement when I use $project before $group
I have changed "@h" data type from String to Integer (NumberInt), I think it will improve some more.

标签： mongodb mongodb-query aggregation-framework analytics

1条回答

做个烂人

2楼-- · 2019-05-11 11:14

Let's get to your question one by one:

Though both the queries examined the same number of documents Why there is time difference in the output?

Looking at performance metrics from just a single execution is really not how it works. You should take the average over several executions before concluding because there are several factors at play. That being said, MongoDB caches most frequently used documents in memory and keeps it there unless it has to yield memory for some other document. So if a query access documents already cached from a previous query, it should be faster.

Also in MongoDB aggregation uses indexes only at the beginning if any. For example $match and $sortphases can use indexes. In your case $match is the first pipeline stage so that's a win.

Is $lte, $gte is slow only on date range or it is even slow on numeric comparison as well ..?

In MongoDB data is stored in BSON, so dates are basically numbers when they are compared. So there is no difference.

Since bucketing gives faster response, how to use bucketing for range queries? I can make multiple aggregate time bucket calls to support range queries, but that will make more round trip time, any suggestions?

Although I've not tested it I really doubt that the time_bucket approach will give faster response. Since created_at will always increase, the index, in this case, will also be appended to the end without the time_bucket. Furthermore, the index size will be comparatively huge when created on an array than on a simple date field. Will that not cause the issue of fitting the index in RAM.

Using a time_bucket makes sense when you are using some function on the date field before matching. If you extract only year from date field before matching, it will make the existing index on the date useless.

It's always better to cast your parameters to match the datatype in database instead of the other way around.

Is that possible $and of two $match in an aggregate query to support range bucketing. Currently Aggregate accepts multiple $match but first output of $match is given to second $match, but what I want is adding/grouping of individual $match results to the next pipeline.

Yes, it's possible. If it's $and, you just specify all your filters separated by commas in the $match phase. If it's $or use the $or operator.

If you have two $macth phases one by one MongoDB combines it to one. So you don't need to worry about adding results of multiple match phases.

Now your Optimization Points

I felt more time is taken because of reading of non-indexed fields. hence docsExamined:106888

Yes, covered queries are much faster.

There is some betterment when I use $project before $group

If the size of documents is reduced in the $group stage by use of $project, then yes it's true.

I have changed @h dataType from string to int (NumberInt), I think it will improve some more.

It's not necessarily true but it's generally the case. You may check this answer.

0人赞添加讨论(0) 举报

Range support over time bucketed analytics

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间