Merging dataframes with all.equal on numeric(float

2019-09-15 05:04发布

问题:

I have two data frames I want to merge based on a numeric value, however I am having trouble with floating point accuracy. Example:

> df1 <- data.frame(number = 0.1 + seq(0.01,0.1,0.01), letters = letters[1:10])
> df2 <- data.frame(number = seq(0.11,0.2,0.01), LETTERS = LETTERS[1:10])
> (merged <- merge(df1, df2, by = "number", all = TRUE))
   number letters LETTERS
1    0.11       a       A
2    0.12    <NA>       B
3    0.12       b    <NA>
4    0.13       c       C
5    0.14       d       D
6    0.15    <NA>       E
7    0.15       e    <NA>
8    0.16       f       F
9    0.17       g       G
10   0.18       h       H
11   0.19       i       I
12   0.20       j       J

Some of the values (0.12 and 0.15) don't match up due to floating point accuracy issues as discussed in this post. The solution for finding equality there was the use of the all.equal function to remove floating point artifacts, however I don't believe there is a way to do this within the merge function.

Currently I get around it by forcing one of the the number columns to a character and then back to a number after merge, but this is a little clunky; does anyone have a better solution for this problem?

> df1c <- df1
> df1c[["number"]] <- as.character(df1c[["number"]])
> merged2 <- merge(df1c, df2, by = "number", all = TRUE)
> merged2[["number"]] <- as.numeric(merged2[["number"]])
> merged2
   number letters LETTERS
1    0.11       a       A
2    0.12       b       B
3    0.13       c       C
4    0.14       d       D
5    0.15       e       E
6    0.16       f       F
7    0.17       g       G
8    0.18       h       H
9    0.19       i       I
10   0.20       j       J

EDIT: A little more about the data

  • I wanted to keep my question general to make it more applicable to other people's problems, but it seems I may need to be more specific to get an answer.
  • It is likely that all of the issues with merging with be due to floating point inaccuracy, but it may be a little hard to be sure. The data comes in as a series of time series values, a start time, and a frequency. These are then turned into a time series (ts) object and a number of functions are called to extract features from the time series (one of which is the time value), which is returned as a data frame. Meanwhile another set of functions is being called to get other features from the time series as targets. There are also potentially other series getting features generated to complement the original series. These values then have to be reunited using the time value.
  • Can't store as POSIXct: Each of these processes (feature extraction, target computation, merging) has to be able to occur independently and be stored in a CSV type format so it can be passed to other platforms. Storing as a POSIXct value would be difficult since the series aren't necessarily stored in calendar times.

回答1:

Round to the level of precision that will allow the number to be equal.

> df1$number=round(df1$number,2)
> df2$number=round(df2$number,2)
> 
>  (merged <- merge(df1, df2, by = "number", all = TRUE))
   number letters LETTERS
1    0.11       a       A
2    0.12       b       B
3    0.13       c       C
4    0.14       d       D
5    0.15       e       E
6    0.16       f       F
7    0.17       g       G
8    0.18       h       H
9    0.19       i       I
10   0.20       j       J

If you need to choose the level of precision programmatically then you should tell us more about the data and whether we can perhaps assume that it will always be due to floating point inaccuracy. If so, then rounding to 10 decimal places should be fine. The all.equal function uses sqrt(.Machine$double.eps) which in usually practice should be similar to round( ..., 16).