A simple question on a simple seemingly innocent function: summary
.
Until I saw results for Min and Max that were outside the range of my data, I was unaware that summary
has a digits
argument to specify precision of the output results. My question is about how to address this in a clean, universal manner.
Here is an example of the issue:
set.seed(0)
vals <- 1 + 10 * 1:50000
df <- cbind(rnorm(10000),sample(vals, 10000), runif(10000))
Applying summary
and range
, we get the following output - notice the discrepancy in the range values versus the Min and Max:
> apply(df, 2, summary)
[,1] [,2] [,3]
Min. -3.703000 11 6.791e-05
1st Qu. -0.668500 122800 2.498e-01
Median 0.009778 248000 5.014e-01
Mean 0.010450 248800 5.001e-01
3rd Qu. 0.688800 374000 7.502e-01
Max. 3.568000 499900 9.999e-01
> apply(df, 2, range)
[,1] [,2] [,3]
[1,] -3.703236 11 6.790622e-05
[2,] 3.568101 499931 9.998686e-01
Seeing erroneous ranges in summary
is a little disconcerting, so I looked at the digits
option, but this is simply the standard notation for formatting output. Also note: Every single quantile other than Min shows a value that does not exist in the data set (this is why I put a 1 +
in the definition for vals
), nor would one see these quantiles in most standard quantile calculations, even allowing for differences in midpoint selection. (When I saw this in the original data, I wondered how I had lost a value of 1 from everything!)
There is a difference between explicable computational behavior (i.e. formatting and precision) and statistically-motivated expecations (such values identified as quantiles actually being within the range of the dataset). Since we can't change the expectations, we need to change the behavior of the code or at least improve it.
The question: Is there some more appropriate way to set the output to be sure of the range, other than setting it to a large value, e.g. digits = 16
? Is 16 even the most appropriate universal default? Using 16 digits seems to be the best guarantee of precision for double floats, though it seems the output will not actually have 16 digits (the output still seems to be truncated to 8 or 9 digits).
Update 1: As @BrianDiggs has noted, via the links, the behavior is documented, but unexpected. To clarify my issue, relative to the answers on the link provided by Brian (excepting the answer by Brian himself): it's not that the behavior is undocumented, but it's flatly wrong to denote as Min and Max values which are not Min and Max. A documented function that gives incorrect output in its default settings needs to be used with non-default settings (or should not be used). (Maybe one could argue whether "Min" and "Max" should be renamed as "Approximate Min" and "Approximate Max", but let's not go there.)
Update 2: As @Dwin has noted, summary()
takes as its default max(3, getOption("digits") - 3)
. I'd previously erred in saying the default was 3. What's interesting about this is that this implies two ways to set the behavior of the output. If we use both, the behavior gets weird:
> options(digits = 20)
> apply(df, 2, summary, digits = 10)
[,1] [,2] [,3]
Min. -3.7032358429999998605808 11.00000000000000 6.7906221370000004927e-05
1st Qu. -0.6684710537000000396546 122798.50000000000000 2.4977348059999998631e-01
Median 0.0097783099960000001427 247971.00000000000000 5.0137970539999998643e-01
Mean 0.0104475229200000005458 248776.38699999998789 5.0011818200000002221e-01
3rd Qu. 0.6887842181000000119084 374031.00000000000000 7.5024240300000000214e-01
Max. 3.5681007909999999938577 499931.00000000000000 9.9986864070000003313e-01
Notice that this now has 20 digits of output, even though the argument passed specifies 10 digits of precision. If we set the global option for digits to be some "sane" value like 16, we still end up with issues if we provide summary
with an argument of 10.
I believe the documentation is incomplete, and Brian Diggs has pointed out other issues with it in his thoughtful answer in the link to R-help.
Despite these wrinkles, the question remains open, but maybe it can't be answered. I suspect that the best result is simply to leave the global digits option as-is (though I am a little disturbed by the implications of the above behavior) and instead pass a value of 16 to summary
. It isn't immediately obvious where the output precision is specified, but this interaction of 4 values - the global option (and the global option - 3), the passed value, and a hard-coded value of 12 in summary.data.frame
looks like (have meRcy on my soul for saying this) a hack.
Update 3: I'm accepting DWin's answer - it led to me understanding how this sausage is made. Seeing what is going on, I don't think there's a way to do what I ask, without rewriting summary
.