Getting SciPy quantiles to match Stata xtile funct

2019-05-28 21:47发布

问题:

I've inherited some old Stata code (Stata11) that uses the xtile function to categorize observations in a vector by their quantiles (in this case, just the standard 5 quintiles, 20%, 40%, 60%, 80%, 100%).

I'm trying to replicate a piece of the code in Python and I am using the SciPy.stats.mstats function mquantiles() for the computation.

As near as I can tell from Stata documentation and searching online, the Stata xtile method tries to invert the empirical CDF of the data, and uses the equal-weighted average of all observations for which the CDF is flat to make the cutpoint. This seems like a very poor way to categorize quantiles, but it is what it is and I am sure there are cases where this is the right thing to do.

My question is how to make mquantiles() produce the same sort of breaking convention. I noticed that this function has two parameters, alphap and betap (the documentation calls them alpha and beta but you need the extra 'p' to get it to work, at least I do... I get an error if I just use 'alpha' and 'beta' with Python 2.7.1 and SciPy 0.10.0). But even in the SciPy docs, I can't see whether there's a combination of these parameters that produces the mean over flat CDF ranges.

I see what looks like the option to compute as the median or mode of this range, but not mean (it's also not clear if these SciPy median/mode options with alpha and beta are computed as the median/mode of the observations or of the range that would produce the flat CDF value.)

Any help disambiguating these different options and finding some documentation that helps me recreate the Stata convention in Python would be great. Please refrain from answers that just say "write your own quantile function." Firstly, that doesn't help me understand the conventions of either Stata or SciPy, and secondly, given these numerical libraries, writing my own quantile function should be a last resort. I can certainly do it, but it would be bad all around if I need to.

回答1:

The scipy.stats.mquantiles documentation was poor and wrong in places, fixed now so that might be helpful... http://docs.scipy.org/scipy/docs/scipy.stats.mstats_basic.mquantiles/. That process started when you pointed out the alpha/beta, alphap/betap discrepancy. Thank you.

The implementation of mquantiles follow R.

The biggest difference comes from that R has 9 discrete types, where because scipy.stats.mquantiles calculates 'm' from 'alphap' and 'betap', scipy has a continuous range of "types" (for lack of a better word).

I admit that I do not understand all of the ins and outs of the statistics involved so I settled on a brute force evaluation. I found an xtile example at http://www.biostat.sdu.dk/~biostat/StataReferenceManual/StataRef.pdf and was able to match the results with alphap=0.5, and betap=0.5 (piecewise linear). Not definitive nor exhaustive, but all I have right now.

In [1]: import scipy.stats as st

In [9]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.5],alphap=0.5,betap=.5)
Out[9]: array([ 61.5])

In [10]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.33,0.66],alphap=0.5,betap=.5)
Out[10]: array([ 38.84,  81.72])

In [11]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.25,0.5,0.75],alphap=0.5,betap=.5)
Out[11]: array([ 23. ,  61.5,  99. ])

The last is a little problematic since two of the division points are exactly on values in the data set. Stata/xtile (at least in the examples that I found) does not give the split points for the quantiles but gives the quantiles themselves. Given the sorted data set [17,23,56,67,99,123], Stata/xtile gave the categorization as [1,1,2,3,3,4] which means that for scipy.stat.mquantiles to match the upper bound of a quantile is greater than or equal to all values in that quantile.