Based on the aggregation pipeline docs,
"any single aggregation operation consumes more than 10 percent of system RAM, the operation will produce an error."
- http://docs.mongodb.org/manual/core/aggregation-pipeline-limits/
Is there any way of increasing this limit? I have also set allowDiskUse: true (so the error is no longer an issue), but would like to use more RAM to improve performance.
Background:
I am running a large aggregate job on mongodb on about 100 million entries. It is basically a massive call to $group to merge the entries based on a key.
I am using the dev release of mongo v 2.6.0-rc2 (3/21/2014)
Well no there is no setting and if you really think about it there is good reason for this. So if you first consider what aggregate is doing and what MongoDB does in general it should become clear.
This is what "should" be at the "head" of any sensible aggregation pipeline:
db.collection.aggregate([
{ "$match:{ /* Something here */ } },
And these are the reasons:
It makes good sense to try to reduce the working set that you are operating on in any operation.
This is also the only time you get the opportunity to use an index to aid in searching the selection. Which is always better than a collection scan.
Even though there is a built in "optimizer" that looks for such things as "projections" limiting the "selected" fields, the best scrutineer of working set size is to only work on the valid records. Later stage matches are not "optimized" in this way.(See point 1)
The next thing to consider is the general behavior of MongoDB. So that the server process wants to do, is "consume" as much of the available machine memory as it can in order to hold the "working set" data ( collections and/or index ) in order to "work" on that data in the most efficient means.
So it really is in the "best interests" of the database engine to "spend" most of it's memory allocation in this way. As in that way, both your "aggregate" job and all of the other concurrent processes have access to the "working data" in the memory space.
So therefore it is "not optimal" for MongoDB to "steal" this memory allocation away from the other concurrent operations just to service your running aggregation operation.
In the "programming to hardware requirements" terms, well you are aware that future releases allow the aggregation pipeline to implement "disk use" in order to allow larger processing. You can always implement SSD's or other fast storage technologies. And of course "10%" of RAM is subjective to the amount of RAM that is installed in a system. So you can always increase that.
The roundup of this is, MongoDB has an actual job of being a "concurrent datastore" and does that well. What it is not is a specific "aggregation job-runner" and should not be treated as such.
So either "break-up" your workloads, or increase your hardware spec, or simply switch the large "task running" activity to something that does focus on the running job such as a Hadoop-style "mapReduce", and leave MongoDB to it's job of serving the data.
Or of course, change your design to simply "pre-aggregate" the required data somewhere "on write".
As the saying goes, "Horses for courses", or use your tools for what they were designed for.