Definitions of quantiles in R

2019-01-19 00:44发布

问题:

Main question: Suppose you have a discrete, finite data set $d$. Then the command summary(d) returns the Min, 1st quartile, Median, mean, 3rd quartile, and max. My question is: what formula does R use to compute the 1st quartile?

Background: My data set was: d=c(1,2,3,3,4,9). summary(d) returns 2.25 as the first quartile. Now, one way to compute the first quartile is to choose a value q1 such that 25% of the data set is less than of equal to q1. Clearly, this is not what R is using. So, I was wondering, what formula does R use to compute the first quartile?

Google searches on this topic have left even even more puzzled, and I couldn't find a formula that R uses. Typing help(summary) in R wasn't helpful to me too.

回答1:

General discussion:

There are many different possibilities for sample quantile functions; we want them to have various properties (including being simple to understand and explain!), and depending on which properties we want most, we might prefer different definitions.

As a result, the wide variety of packages between them use many different definitions.

The paper by Hyndman and Fan [1] gives six desirable properties for a sample quantile function, lists nine existing definitions for the quantile function, and mentions which (of a number of common) packages use which definitions. Its Introduction says (sorry, the mathematics in this quote doesn't render properly any more, since it was moved to SO):

the sample quantiles that are used in statistical packages are all based on one or two order statistics, and can be written as

\hat{Q}_i(p) = (1 - γ) X_{(j)} + γ X_{(j+1)}\,,
where \frac{j-m}{n}\leq p< \frac{j-m+1}{n} \quad (1)

for some m\in \mathbb{R} and 0\leq\gamma\leq 1.

Which is to say, in general, the sample quantiles can be written as some kind of weighted average of two adjacent order statistics (though it may be that there's only weight on one of them).

In R:

In particular, R offers all nine definitions mentioned in Hyndman & Fan (with $7$ as the default). From Hyndman & Fan we see:

Definition 7. Gumbel (1939) also considered the modal position $p_k = \text{mode}\,F(X_{(k)}) = (k-l)/(n-1)$. One nice property is that the vertices of $Q_7(p)$ divide the range into $n-1$ intervals, and exactly $100p\%$ of the intervals lie to the left of $Q_7(p$) and $100(1-p)\%$ of the intervals lie to the right of $Q_7(p)$.

What does this mean? Consider n=9. Then for (k-1)/(n-1) = 0.25, you need k = 1+(9-1)/4 = 3. That is, the lower quartile is the 3rd observation of 9.

We can see that in R:

quantile(1:9)
  0%  25%  50%  75% 100% 
   1    3    5    7    9 

For its behavior when n is not of the form 4k+1, the easiest thing to do is try it:

> quantile(1:10)
   0%   25%   50%   75%  100% 
 1.00  3.25  5.50  7.75 10.00 
> quantile(1:11)
  0%  25%  50%  75% 100% 
 1.0  3.5  6.0  8.5 11.0 
> quantile(1:12)
   0%   25%   50%   75%  100% 
 1.00  3.75  6.50  9.25 12.00 

When k isn't integer, it's taking a weighted average of the adjacent order statistics, in proportion to the fraction it lies between them (that is, it does linear interpolation).

The nice thing is that on average you get 3 times as many observations above the first quartile as you get below. So for 9 observations, for example, you get 6 above and 2 below the third observation, which divides them into the ratio 3:1.

What's happening with your sample data

You have d=c(1,2,3,3,4,9), so n is 6. You need (k-1)/(n-1) to be 0.25, so k = 1 + 5/4 = 2.25. That is, it takes 25% of the way between the second and third observation (which coincidentally are themselves 2 and 3), so the lower quartile is 2+0.25*(3-2) = 2.25.

Under the hood: some R details:

When you call summary on a data frame, this results in summary.data.frame being applied to the data frame (i.e. the relevant summary for the class you called it on). Its existence is mentioned in the help on summary.

The summary.data.frame function (ultimately -- via summary.default applied to each column) calls quantile to compute quartiles (you won't see this in the help, unfortunately, since ?summary.data.frame simply takes you to the summary help and that doesn't give you any details on what happens when summary is applied to a numeric vector -- this is one of those really bad spots in the help).

So ?quantile (or help(quantile)) describes what R does.

Here are two things it says (based directly off Hyndman & Fan). First, it gives general information:

All sample quantiles are defined as weighted averages of consecutive order statistics. Sample quantiles of type i are defined by:

Q[i](p) = (1 - γ) x[j] + γ x[j+1],

where 1 ≤ i ≤ 9, (j-m)/n ≤ p < (j-m+1)/n, x[j] is the jth order statistic, n is the sample size, the value of γ is a function of j = floor(np + m) and g = np + m - j, and m is a constant determined by the sample quantile type.

Second, there's specific information about method 7:

Type 7
m = 1-p

. p[k] = (k - 1) / (n - 1). In this case, p[k] = mode[F(x[k])]. This is used by S.

Hopefully the explanation I gave earlier helps to make more sense of what this is saying. The help on quantile pretty much just quotes Hyndman & Fan as far as definitions go, and its behavior is pretty simple.


Reference:

[1]: Rob J. Hyndman and Yanan Fan (1996),
"Sample Quantiles in Statistical Packages,"
The American Statistician, Vol. 50, No. 4. (Nov.), pp. 361-365

Also see the discussion here.



标签: r quantile