Read this, and tried experimenting it.
Below is my schema.
SCHEMA:
{
"s" : "CB",
"c" : "REQ_RCV",
"e" : "sms_click",
"st" : "i",
"b" : "2",
"a" : "1",
"u" : "b1_h1_d1_m1_user_2",
"c#" : "b1_h1_d1_m1_cr-2",
"@" : ISODate("2016-10-01T06:03:00.000Z"), //Indexed
"@h" : "16100106", //hourly bucket
"@d" : "161001", //Indexed
"@m" : "1610"
}
And below is the explain plan:
> 2017-01-22T13:43:47.764+0530 I COMMAND [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @: { $gte: new
> Date(1483228800000), $lte: new Date(1483315199000) } } }, { $group: {
> _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0 } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s", TIME:
> "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET: 1.0,
> TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @: 1.0 }
> keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:925 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1860 } }, Database: { acquireCount: { r:
> 930 } }, Collection: { acquireCount: { r: 929 } } }
> protocol:op_command **3499ms**
> 2017-01-22T13:44:24.825+0530 I COMMAND [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @d: "170101" } }, {
> $group: { _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0
> } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s",
> TIME: "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET:
> 1.0, TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @d: 1.0 } keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:865 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1740 } }, Database: { acquireCount: { r:
> 870 } }, Collection: { acquireCount: { r: 869 } } }
> protocol:op_command **1294ms**
Questions:
- Though both the queries are examined same number of documents Why there is time difference in the output?
- Are
$lte
,$gte
operators slow only on date ranges or even slow on numeric comparison as well? - Since bucketing gives faster response, how do I use bucketing for range queries? I can make multiple aggregate time bucket calls to support range queries, but that will make more round trip time, any suggestions?
- Is it possible to use
$and
for two$match
in a aggregate query to support range bucketing? Currently aggregate accepts multiple$match
but first output of$match
is given to second$match
, but what I want is adding/grouping of individual$match
results to the next pipeline.
Possible answer for Q4:
db.template3.aggregate([
{
$match: {
$or: [
{"@d":"170301"},
{"@d":"170306"},
{"@d":"170202"},
{"@d":"170303"},
{"@d":"170304"},
{"@d":"170305"}
]
}
},
{ $project: { _id: 0, "b": 1, "s": 1, "@h": 1 } },
{
$group: {
_id: {"b": "$b", "HOURLY": "$@h", "s": "$s" },
count: { $sum: 1 }
}
},
{
$project: {
_id: 0,
"BUCKET": "$_id.b",
"SERVICE": "$_id.s",
"TIME": "$_id.HOURLY",
count: 1
}
},
{ $sort: { "SERVICE": 1, "BUCKET": 1, "TIME": 1 } }
]);
In this answer we can use mixes (daily and monthly) of buckets and still this will use its own index. Read https://docs.mongodb.com/manual/reference/operator/query/or/#or-clauses-and-indexes.
Sample query:
db.template3.aggregate([
{$match:{"@h":{$gte : 17020511, $lte : 17030511}, "st":"i"}},
{$project : {"_id":0, "@h":1,"c":1, "@m":1}},
{$group:{_id:{ "HOURLY":"$@h", "c":"$c"}, count:{$sum:1}}},
{$project : {_id:0, "COUNTER":"$_id.c","TIME":"$_id.HOURLY", count:1}},
{$sort:{"COUNTER":1,"TIME":1}}
]);
Output:
{ "count" : 2255, "COUNTER" : "REQ_RCVD", "TIME" : 17020511 }
{ "count" : 28888, "COUNTER" : "REQ_RCVD", "TIME" : 17020600 }
{ "count" : 37613, "COUNTER" : "REQ_RCVD", "TIME" : 17020601 }
{ "count" : 6723, "COUNTER" : "REQ_RCVD", "TIME" : 17020602 }
{ "count" : 14057, "COUNTER" : "REQ_RCVD", "TIME" : 17020603 }
{ "count" : 12405, "COUNTER" : "REQ_RCVD", "TIME" : 17020604 }
{ "count" : 2392, "COUNTER" : "REQ_RCVD", "TIME" : 17020611 }
{ "count" : 28784, "COUNTER" : "REQ_RCVD", "TIME" : 17020700 }
{ "count" : 37494, "COUNTER" : "REQ_RCVD", "TIME" : 17020701 }
{ "count" : 6697, "COUNTER" : "REQ_RCVD", "TIME" : 17020702 }
{ "count" : 13930, "COUNTER" : "REQ_RCVD", "TIME" : 17020703 }
{ "count" : 12493, "COUNTER" : "REQ_RCVD", "TIME" : 17020704 }
{ "count" : 2225, "COUNTER" : "REQ_RCVD", "TIME" : 17020711 }
{ "count" : 28821, "COUNTER" : "REQ_RCVD", "TIME" : 17020800 }
{ "count" : 37949, "COUNTER" : "REQ_RCVD", "TIME" : 17020801 }
{ "count" : 6676, "COUNTER" : "REQ_RCVD", "TIME" : 17020802 }
{ "count" : 14039, "COUNTER" : "REQ_RCVD", "TIME" : 17020803 }
{ "count" : 12349, "COUNTER" : "REQ_RCVD", "TIME" : 17020804 }
{ "count" : 2332, "COUNTER" : "REQ_RCVD", "TIME" : 17020811 }
{ "count" : 28379, "COUNTER" : "REQ_RCVD", "TIME" : 17020900 }
OPTIMIZATION
I felt more time is taken because of reading of non indexed fields. hence
docsExamined: 106888
There is somehow an improvement when I use
$project
before$group
I have changed
"@h"
data type fromString
toInteger
(NumberInt
), I think it will improve some more.
Let's get to your question one by one:
Looking at performance metrics from just a single execution is really not how it works. You should take the average over several executions before concluding because there are several factors at play. That being said, MongoDB caches most frequently used documents in memory and keeps it there unless it has to yield memory for some other document. So if a query access documents already cached from a previous query, it should be faster.
Also in MongoDB aggregation uses indexes only at the beginning if any. For example
$match
and$sort
phases can use indexes. In your case$match
is the first pipeline stage so that's a win.In MongoDB data is stored in BSON, so dates are basically numbers when they are compared. So there is no difference.
Although I've not tested it I really doubt that the time_bucket approach will give faster response. Since
created_at
will always increase, the index, in this case, will also be appended to the end without the time_bucket. Furthermore, the index size will be comparatively huge when created on an array than on a simple date field. Will that not cause the issue of fitting the index in RAM.Using a time_bucket makes sense when you are using some function on the date field before matching. If you extract only year from date field before matching, it will make the existing index on the date useless.
It's always better to cast your parameters to match the datatype in database instead of the other way around.
Yes, it's possible. If it's
$and
, you just specify all your filters separated by commas in the$match
phase. If it's$or
use the$or
operator.If you have two
$macth
phases one by one MongoDB combines it to one. So you don't need to worry about adding results of multiple match phases.Now your Optimization Points
Yes, covered queries are much faster.
If the size of documents is reduced in the
$group
stage by use of$project
, then yes it's true.It's not necessarily true but it's generally the case. You may check this answer.