I am new to MongoDB aggregation and wondered whether there is a way to calculate the median using the MongoDB aggregation framework?
Cheers,
Lewis
I am new to MongoDB aggregation and wondered whether there is a way to calculate the median using the MongoDB aggregation framework?
Cheers,
Lewis
The median is somewhat tricky to compute in the general case, because it involves sorting the whole data set, or using a recursion with a depth that is also proportional to the data set size. That's maybe the reason why many databases don't have a median operator out of the box (MySQL doesn't have one, either).
The simplest way to compute the median would be with these two statements (assuming the attribute on which we want to compute the median is called a
and we want it over all documents in the collection, coll
):
count = db.coll.count();
db.coll.find().sort( {"a":1} ).skip(count / 2 - 1).limit(1);
This is the equivalent to what people suggest for MySQL.
It's possible to do it in one shot with the aggregate framework.
Sort => put in Array sorted values => get Size of array => divide size by two => get Int value of the division (left side of median) => add 1 to left side ( right side) => get array element at left side and right side => average of the two elements
This is a sample with Spring java mongoTemplate :
The model is a list of book with the login of the author ("owner"), the objective is to get the median of book by users :
GroupOperation countByBookOwner = group("owner").count().as("nbBooks");
SortOperation sortByCount = sort(Direction.ASC, "nbBooks");
GroupOperation putInArray = group().push("nbBooks").as("nbBooksArray");
ProjectionOperation getSizeOfArray = project("nbBooksArray").and("nbBooksArray").size().as("size");
ProjectionOperation divideSizeByTwo = project("nbBooksArray").and("size").divide(2).as("middleFloat");
ProjectionOperation getIntValueOfDivisionForBornLeft = project("middleFloat", "nbBooksArray").and("middleFloat")
.project("trunc").as("beginMiddle");
ProjectionOperation add1ToBornLeftToGetBornRight = project("beginMiddle", "middleFloat", "nbBooksArray")
.and("beginMiddle").project("add", 1).as("endMiddle");
ProjectionOperation arrayElementAt = project("beginMiddle", "endMiddle", "middleFloat", "nbBooksArray")
.and("nbBooksArray").project("arrayElemAt", "$beginMiddle").as("beginValue").and("nbBooksArray")
.project("arrayElemAt", "$endMiddle").as("endValue");
ProjectionOperation averageForMedian = project("beginMiddle", "endMiddle", "middleFloat", "nbBooksArray",
"beginValue", "endValue").and("beginValue").project("avg", "$endValue").as("median");
Aggregation aggregation = newAggregation(countByBookOwner, sortByCount, putInArray, getSizeOfArray,
divideSizeByTwo, getIntValueOfDivisionForBornLeft, add1ToBornLeftToGetBornRight, arrayElementAt,
averageForMedian);
long time = System.currentTimeMillis();
AggregationResults<MedianContainer> groupResults = mongoTemplate.aggregate(aggregation, "book",
MedianContainer.class);
And here the resulting aggregation :
{
"aggregate": "book" ,
"pipeline": [
{
"$group": {
"_id": "$owner" ,
"nbBooks": {
"$sum": 1
}
}
} , {
"$sort": {
"nbBooks": 1
}
} , {
"$group": {
"_id": null ,
"nbBooksArray": {
"$push": "$nbBooks"
}
}
} , {
"$project": {
"nbBooksArray": 1 ,
"size": {
"$size": ["$nbBooksArray"]
}
}
} , {
"$project": {
"nbBooksArray": 1 ,
"middleFloat": {
"$divide": ["$size" , 2]
}
}
} , {
"$project": {
"middleFloat": 1 ,
"nbBooksArray": 1 ,
"beginMiddle": {
"$trunc": ["$middleFloat"]
}
}
} , {
"$project": {
"beginMiddle": 1 ,
"middleFloat": 1 ,
"nbBooksArray": 1 ,
"endMiddle": {
"$add": ["$beginMiddle" , 1]
}
}
} , {
"$project": {
"beginMiddle": 1 ,
"endMiddle": 1 ,
"middleFloat": 1 ,
"nbBooksArray": 1 ,
"beginValue": {
"$arrayElemAt": ["$nbBooksArray" , "$beginMiddle"]
} ,
"endValue": {
"$arrayElemAt": ["$nbBooksArray" , "$endMiddle"]
}
}
} , {
"$project": {
"beginMiddle": 1 ,
"endMiddle": 1 ,
"middleFloat": 1 ,
"nbBooksArray": 1 ,
"beginValue": 1 ,
"endValue": 1 ,
"median": {
"$avg": ["$beginValue" , "$endValue"]
}
}
}
]
}
The aggregation framework doesn't support median out-of-the-box. So you will have to write something on your own.
I would recommend you to do this on the application level. Retrieve all your documents with a normal find(), sort the result sets (either on the datbase by using the .sort()
function of the cursor or sorting them in the application - your decision) and then getting the element size / 2
.
When you really want to do it on the database level, you could do it with map-reduce. The map-function would emit key and an array with a single value - the value you want to get the median of. The reduce-function would just concatenate the arrays of the results it receives, so each key ends up with an array with all values. The finalize-function would then compute the median of that array, again by by sorting the array and then get the element number size / 2
.