Mysterious behaviour of seq and == operator. A pre

2019-06-24 12:29发布

问题:

I've come across a somehow weird (or just not expected?) behaviour of the function seq. When creating a simple sequence some values cannot be matched correctly with the == operator. See this minimal example:

my.seq <- seq(0, 0.4, len = 5)
table(my.seq)                  # ok! returns  0 0.1 0.2 0.3 0.4 
                               #              1   1   1   1   1 

which(my.seq == 0.2)           # ok! returns  3
which(my.seq == 0.3)           # !!! returns  integer(0)

When creating my sequence manually, it seems to work, though:

my.seq2 <- c(0.00, 0.10, 0.20, 0.30, 0.40)

which(my.seq2 == 0.3)           # ok! returns  4

Do you have any explanation for that? I solved the issue by using which(round(my.seq, 2) == 0.3) but I would be interested in what's causing the problem.

Thanks in advance for your comments.

回答1:

Computers just don't represent floating point numbers well. The general tendencies of spreadsheets to hide this has, as the primary way most people deal with numbers on computers, lead to many problems.

Never match against exact floating point values. There are functions in R to deal with this (e.g. all.equal) but I prefer the following.

Say you have an unknown floating point variable A and you want to see if it is equal to 0.5.

abs(A - 0.5) < tol

Set tolerance to how close you need it to 0.5. For example, tol <- 0.0001 might be fine for you.

If your values look like they should be integers just round. Or, if you know the decimal level that you want to test to then you can round to that decimal level.



回答2:

Computers have a tough time storing exact values.

> options(digits=22)
> seq(0, .4, len = 5)
[1] 0.0000000000000000000000 0.1000000000000000055511 0.2000000000000000111022
[4] 0.3000000000000000444089 0.4000000000000000222045
> .4
[1] 0.4000000000000000222045
> c(0, .1, .2, .3, .4)
[1] 0.0000000000000000000000 0.1000000000000000055511 0.2000000000000000111022
[4] 0.2999999999999999888978 0.4000000000000000222045

Since we're using a binary floating point representation we can't represent the values of interest exactly. It looks since the value for .4 is a little bit higher than .4 that the value for .3 is a little bit higher than if you would type .3 itself. I'm sure somebody else will provide a better explanation for this but hopefully this sheds some light on the issue.



回答3:

This is FAQ 7.31, which also has a link to a longer discussion of the problem in general.



标签: r precision seq