I have an algorithm that calculates the percentile(85)
with Apache Commons
of a series of values (12 values), for a later evaluation with a threshold to make a decision. The result is similar to the one given by Excel, but not equal, and sometimes this is critical for my application because with excel the result doesn't pass the threshold and with Apache Commons Math in Java
it does, so I get different outputs.
Here it is an example: Internet traffic (Mbps) every 2 hours
32,7076813360000000 41,2580429776000000 45,4453940200000000 48,8044409456000000 46,7462847936000000 49,8028100056000000 54,3719451144000000 41,9708134600000000 29,4371963240000000 22,4667255616000000 20,0388452248000000 28,7807757104000000
After dividing by 1000 Mb (the capacity of the cable) I calculate the percentil(85) of the Occupation:
Excel: 0,049153870117
Apache Commons Math: 0.05003126676104001
I have found that it is possible to change the implementation of the percentile (it does not exist an official one) with setPercentileImpl()
, but I couldn't find any example of how to do this, or the Excel algorithm (which is the one I was told to achieve).
Any help about this will be welcomed.
Thank you.
The difference is subtle and due to assumptions. It is easiest to explain with the 3 element case. Suppose you have three elements(N=3) a=x[0] < b=x[1] < c=x[2]
. Both the Apache and the Excel method say that element b is the 50th percentile (the median). However they differ for a
and c
.
The Apache method (and the method referenced by the NIST page) say a
is the 25th percentile and c
is the 75% percentile, because it divides the space up into N+1 blocks, that is, into quarters.
The Excel method says that a
is the 0th percentile and c
the 100th percentile, as the space is divided into N-1 blocks, that is, in half.
Because of this, if you want the Excel method and you don't want to code it yourself, you could just remove the smallest and the largest element from your array, and call the Apache method - it should give you exactly the same result except at percentiles beyond the end points.
If you want to code it yourself, an easy way is given below. Be aware of these issues:
- this sorts the array (so changes it)
- this takes O(N log(N)) time due to the sorting. The Apache method uses a fast selection algorithm so takes O(N) time (google "quickselect" if you want to know more)
Code (not tested or even compiled, but should give you an idea).
// warning - modifies data
double excelPercentile(double [] data, double percentile) { array
Arrays.sort(data);
double index = percentile*(data.length-1);
int lower = (int)Math.floor(index);
if(lower<0) { // should never happen, but be defensive
return data[0];
}
if(lower>=data.length-1) { // only in 100 percentile case, but be defensive
return data[data.length-1);
}
double fraction = index-lower;
// linear interpolation
double result=data[lower] + fraction*(data[lower+1]-data[lower]);
return result;
}
The solution was creating a class PercentileExcel which is almost a copy of percentile from commons method except for a small change on how to caculate the position:
pos=(1+p*(n-1))/100;
Then you need to add this line to the code in order to use the new class for percentile:
setPercentileImpl(PercentileExcel);
There is no unique definition of a percentile computed from a data set. See the Wikipedia page for the most common definitions in use.
Class org.apache.commons.math3.stat.descriptive.rank.Percentile
already supports Excel style interpolation, you just need to enable it with EstimationType.R_7
public class PercentileExcel extends Percentile {
public PercentileExcel() throws MathIllegalArgumentException {
super(50.0,
EstimationType.R_7, // use excel style interpolation
NaNStrategy.REMOVED,
new KthSelector(new MedianOf3PivotingStrategy()));
}
}