I've inherited some old Stata code (Stata11) that uses the xtile
function to categorize observations in a vector by their quantiles (in this case, just the standard 5 quintiles, 20%, 40%, 60%, 80%, 100%).
I'm trying to replicate a piece of the code in Python and I am using the SciPy.stats.mstats function mquantiles()
for the computation.
As near as I can tell from Stata documentation and searching online, the Stata xtile
method tries to invert the empirical CDF of the data, and uses the equal-weighted average of all observations for which the CDF is flat to make the cutpoint. This seems like a very poor way to categorize quantiles, but it is what it is and I am sure there are cases where this is the right thing to do.
My question is how to make mquantiles()
produce the same sort of breaking convention. I noticed that this function has two parameters, alphap
and betap
(the documentation calls them alpha
and beta
but you need the extra 'p' to get it to work, at least I do... I get an error if I just use 'alpha' and 'beta' with Python 2.7.1 and SciPy 0.10.0). But even in the SciPy docs, I can't see whether there's a combination of these parameters that produces the mean over flat CDF ranges.
I see what looks like the option to compute as the median or mode of this range, but not mean (it's also not clear if these SciPy median/mode options with alpha and beta are computed as the median/mode of the observations or of the range that would produce the flat CDF value.)
Any help disambiguating these different options and finding some documentation that helps me recreate the Stata convention in Python would be great. Please refrain from answers that just say "write your own quantile function." Firstly, that doesn't help me understand the conventions of either Stata or SciPy, and secondly, given these numerical libraries, writing my own quantile function should be a last resort. I can certainly do it, but it would be bad all around if I need to.