How to treat p value in R ?
I am expecting very low p values like:
1.00E-80
I need to -log10
-log10(1.00E-80)
-log10(0) is Inf, but Inf at sense of rounding too.
But is seems that after 1.00E-308, R yields 0.
1/10^308
[1] 1e-308
1/10^309
[1] 0
Is the accuracy of p-value display with lm
function the same as the cutoff point, 1e-308, or it is just designed such that we need a cutoff point and I need to consider a different cutoff point - such as 1e-100 (for example) to replace 0 with <1e-100.
There are a variety of possible answers -- which one is most useful depends on the context:
- R is indeed incapable under ordinary circumstances of storing floating-point values closer to zero than
.Machine$double.xmin
, which varies by platform but is typically (as you discovered) on the order of 1e-308
. If you really need to work with numbers this small and can't find a way to work on the log scale directly, you need to search Stack Overflow or the R wiki for methods for dealing with arbitrary/extended precision values (but you probably should try to work on the log scale -- it will be much less of a hassle)
- in many circumstances R actually computes p values on the (natural) log scale internally, and can if requested return the log values rather than exponentiating them before giving the answer. For example,
dnorm(-100,log=TRUE)
gives -5000.919. You can convert directly to the log10 scale (without exponentiating and then using log10
) by dividing by log(10)
: dnorm(-100,log=TRUE)/log(10)
=-2171, which would be too small to represent in floating point. For the p***
(cumulative distribution function) functions, use log.p=TRUE
rather than log=TRUE
. (This particular point depends heavily on your particular context. Even if you are not using built-in R functions you may be able to find a way to extract results on the log scale.)
- in some cases R presents p-value results as being
<2.2e-16
even when a more precise value is known: (t1 <- t.test(rnorm(10,100),rnorm(10,80)))
prints
....
t = 56.2902, df = 17.904, p-value < 2.2e-16
but you can still extract the precise p-value from the result
> t1$p.value
[1] 1.856174e-18
(in many cases this behaviour is controlled by the format.pval()
function)
An illustration of how all this would work with lm
:
d <- data.frame(x=rep(1:5,each=10))
set.seed(101)
d$y <- rnorm(50,mean=d$x,sd=0.0001)
lm1 <- lm(y~x,data=d)
summary(lm1)
prints the p-value of the slope as <2.2e-16
, but if we use coef(summary(lm1))
(which does not use the p-value formatting), we can see that the value is 9.690173e-203.
A more extreme case:
set.seed(101); d$y <- rnorm(50,mean=d$x,sd=1e-7)
lm2 <- lm(y~x,data=d)
coef(summary(lm2))
shows that the p-value has actually underflowed to zero. However, we can still get an answer on the log scale:
tval <- coef(summary(lm2))["x","t value"]
2*pt(abs(tval),df=48,lower.tail=FALSE,log.p=TRUE)/log(10)
gives -692.62 (you can check this approach with the previous example where the p-value doesn't overflow and see that you get the same answer as printed in the summary).
Small numbers are generally hard to deal with.
The limit in R for infinite is caused by the use of double precision floating point :
?double All R platforms are required to work with values conforming to the IEC 60559 (also known as IEEE 754) standard. This basically works with a precision of 53 bits, and represents to that precision a range of absolute values from about 2e-308 to 2e+308.
http://en.wikipedia.org/wiki/Double_precision_floating-point_format
You may find the Rmpfr package helpful here as it allows you to create multiple precision numbers.
install.packages("Rmpfr")
require(Rmpfr)
log(mpfr(1/10^309, precBits=500))