R: Find if value is within a certain percentage of

2019-07-19 10:30发布

问题:

I have a dataframe of values and for each value in the dataframe I want to determine if it is within say 10% of any other value in its row. I want to do this generically as I do not know how many columns I will have nor the names of the columns.

Some values are NA, if all other values in the row are NA I want to return TRUE. For the actual values which are NA I want to return FALSE. The values are all positive but can be 0.

For example say I have the follwoing dataframe

dataDF <- data.frame(
                     a = c(100, 250,  NA, 700,   0),
                     b = c(105, 300, 280,  NA,   0),
                     c = c(200, 400, 280,  NA,   0)
                     )

In the first row we have a = 100, b = 105 and c = 200. a and b are within 10% of each other so we would have TRUE for both of those, c is not within 10% of either a or b so would be FALSE.

In the second row no values are within 10% of each other so all would be FALSE

In the third row b and c are equal so are TRUE, a is NA so is FALSE.

In the fourth row we only have a value for a so it is returned as TRUE, b and c are FALSE

In the final row all values are the same, so we would have TRUE for all

So my output would be

data.frame(
           a = c( TRUE, FALSE, FALSE,  TRUE, TRUE),
           b = c( TRUE, FALSE,  TRUE, FALSE, TRUE),
           c = c(FALSE, FALSE,  TRUE, FALSE, TRUE)
          )

How I calculate the percentage difference doesn't really matter but they way I was going to do it would be to divide the absolute difference by the average of the 2 values so that I get the same value whichever way I look at it.

So for example to calculate the percentage difference between 100 and 105 it would be:

abs(100 - 105)/((100 + 105)/2) = 5/102.5 = 0.0488

Any ideas on the quickest and neatest way of doing this would be appreciated.

Thanks

回答1:

Define a function an apply it on each row of your data.frame:

fun <- function(vec)
{
  n = length(vec)

  if(all(is.na(vec)))
    return(rep(FALSE,n))

  noNA = vec[!is.na(vec)]

  if(length(unique(noNA))==1)
    return(!is.na(vec))

  res = rep(FALSE, n)

  for(i in 1:n)
    if(any(abs(vec[i]-vec[-i])<=vec[-i]*0.1, na.rm = TRUE))
      res[i] = TRUE

  res
}

output=data.frame(t(apply(dataDF,1,fun)))
names(output) = names(dataDF)
output

Gives the wanted result:

#      a     b     c
#1  TRUE  TRUE FALSE
#2 FALSE FALSE FALSE
#3 FALSE  TRUE  TRUE
#4  TRUE FALSE FALSE
#5  TRUE  TRUE  TRUE