Main question: Suppose you have a discrete, finite data set $d$. Then the command summary(d) returns the Min, 1st quartile, Median, mean, 3rd quartile, and max. My question is: what formula does R use to compute the 1st quartile?
Background: My data set was: d=c(1,2,3,3,4,9)
. summary(d)
returns 2.25
as the first quartile. Now, one way to compute the first quartile is to choose a value q1 such that 25% of the data set is less than of equal to q1. Clearly, this is not what R is using. So, I was wondering, what formula does R use to compute the first quartile?
Google searches on this topic have left even even more puzzled, and I couldn't find a formula that R uses. Typing help(summary)
in R wasn't helpful to me too.
General discussion:
There are many different possibilities for sample quantile functions; we want them to have various properties (including being simple to understand and explain!), and depending on which properties we want most, we might prefer different definitions.
As a result, the wide variety of packages between them use many different definitions.
The paper by Hyndman and Fan [1] gives six desirable properties for a sample quantile function, lists nine existing definitions for the quantile function, and mentions which (of a number of common) packages use which definitions. Its Introduction says (sorry, the mathematics in this quote doesn't render properly any more, since it was moved to SO):
Which is to say, in general, the sample quantiles can be written as some kind of weighted average of two adjacent order statistics (though it may be that there's only weight on one of them).
In R:
In particular, R offers all nine definitions mentioned in Hyndman & Fan (with $7$ as the default). From Hyndman & Fan we see:
What does this mean? Consider
n=9
. Then for(k-1)/(n-1) = 0.25
, you needk = 1+(9-1)/4 = 3
. That is, the lower quartile is the 3rd observation of 9.We can see that in R:
For its behavior when
n
is not of the form4k+1
, the easiest thing to do is try it:When
k
isn't integer, it's taking a weighted average of the adjacent order statistics, in proportion to the fraction it lies between them (that is, it does linear interpolation).The nice thing is that on average you get 3 times as many observations above the first quartile as you get below. So for 9 observations, for example, you get 6 above and 2 below the third observation, which divides them into the ratio 3:1.
What's happening with your sample data
You have
d=c(1,2,3,3,4,9)
, son
is 6. You need(k-1)/(n-1)
to be0.25
, sok = 1 + 5/4 = 2.25
. That is, it takes 25% of the way between the second and third observation (which coincidentally are themselves 2 and 3), so the lower quartile is2+0.25*(3-2) = 2.25
.Under the hood: some R details:
When you call
summary
on a data frame, this results insummary.data.frame
being applied to the data frame (i.e. the relevantsummary
for the class you called it on). Its existence is mentioned in the help onsummary
.The
summary.data.frame
function (ultimately -- viasummary.default
applied to each column) callsquantile
to compute quartiles (you won't see this in the help, unfortunately, since?summary.data.frame
simply takes you to thesummary
help and that doesn't give you any details on what happens whensummary
is applied to a numeric vector -- this is one of those really bad spots in the help).So
?quantile
(orhelp(quantile)
) describes what R does.Here are two things it says (based directly off Hyndman & Fan). First, it gives general information:
Second, there's specific information about method 7:
Hopefully the explanation I gave earlier helps to make more sense of what this is saying. The help on
quantile
pretty much just quotes Hyndman & Fan as far as definitions go, and its behavior is pretty simple.Reference:
[1]: Rob J. Hyndman and Yanan Fan (1996),
"Sample Quantiles in Statistical Packages,"
The American Statistician, Vol. 50, No. 4. (Nov.), pp. 361-365
Also see the discussion here.