I am looking for a java library / implementation which supports the calculation of the inverse cumulative distribution function for the beta distribution (aka estimation of quantiles) with reasonable precision.
Of course I have tried apache commons math, but in version 3 there still seem to be some issues with the precision. Below the problem which lead to this question is described extensively.
Suppose I want to calculate the credible interval of a beta distribution with a lot of trials. In apache commons math ...
final int trials = 161750;
final int successes = 10007;
final double alpha = 0.05d;
// the supplied precision is the default precision according to the source code
BetaDistribution betaDist = new BetaDistribution(successes + 1, trials - successes + 1, 1e-9);
System.out.println("2.5 percentile :" + betaDist.inverseCumulativeProbability(alpha / 2d));
System.out.println("mean: " + betaDist.getNumericalMean());
System.out.println("median: " + betaDist.inverseCumulativeProbability(0.5));
System.out.println("97.5 percentile :" + betaDist.inverseCumulativeProbability(1 - alpha / 2d));
which delivers
2.5 percentile :0.062030402074808505
mean: 0.06187249616697166
median: 0.062030258659508855
97.5 percentile :0.06305170793994147
The issues is that the 2.5 percentile and median are the same meanwhile both greater than the mean.
In comparison, the R-package binom delivers
binom.confint(10007+1,161750+2,methods=c("agresti-coull","exact","wilson"))
method x n mean lower upper
1 agresti-coull 10008 161752 0.0618725 0.06070873 0.06305707
2 exact 10008 161752 0.0618725 0.06070317 0.06305756
3 wilson 10008 161752 0.0618725 0.06070877 0.06305703
and the R-package stats
qbeta(c(0.025,0.975),10007+1,161750-10007+1)
[1] 0.06070355 0.06305171
To second the results from R, here is what Wolfram Alpha told me
- InverseBetaRegularized[0.025,10007+1,161750-10007+1] => 0.06070354631...
- InverseBetaRegularized[0.975,10007+1,161750-10007+1] => 0.06305170794...
Final note on the requirements:
- I need to run a lot of these calculations. Hence any solution should not take longer than 1s (which is still a lot compared to the 41ms of (albeit wrong) apache commons math).
- I am aware that one can use R within java. For reasons I won't elaborate here, this is the last option if anything else (pure java) fails.
Update 21.08.12
It seems that the issue has been fixed or at least improved in 3.1-SNAPSHOT of apache-commons-math. For the usecase above
2.5 percentile :0.06070354581340706
mean: 0.06187249616697166
median: 0.06187069085946604
97.5 percentile :0.06305170793994147
Update 23.02.13
While at first glance this question and it's responses may be too localized, I think that it very well illustrates that some numerical problems cannot be solved (efficiently) with a what-first-comes-to-mind-hacker-approach. So I hope it remains open.
Most probably, this problem cannot be solved generically, since if the graph of the cumulative distribution function is very flat (which it usually will be towards the tails of the distribution), a very high precision on the vertical axis is needed to reach a reasonable precision on the horizontal axis.
Hence it will always be better to use a function calculating the quantiles directly than deriving the quantiles from the cumulative distribution function.
If you do not worry about precision, you can, of course, solve the equation q = F (x) numerically. Since F is increasing, that's not difficult:
Remark: It is not clear to me why precision should be a problem particularly for the beta distribution, since the beta distribution lives on the interval [0;1] and the graph is rather steep towards the ends of the interval.
Second remark: Your calculation of the upper quantile is wrong; it should read
Third edit: Algorithm corrected.
The issue has been fixed in apache commons math 3.1.1
The testcase above delivered
which matches the results from the r-package stats. Extensive application of 3.1-SNAPSHOT + x versions also did not cause any problems.
I have found and tried the library JSci (Version 1.2 27.07.2010)
Code snippet:
which returned
Internally a root-finding-approach is used as suggest by JohnB. One can extend ProbabilityDistribution#inverse to request more precision. Unfortunately, even with tons of iterations (100k) and a requested precision of 10^-10 the algorithm still returns
Now: Whose code is less wrong ? R or JSci ? I'd favor the one with the bigger user base ...