dplyr mutate column with nearest value in external

2019-08-24 10:47发布

问题:

I'm trying to mutate a column and populate it with exact matches from a list if those occur, and if not, the closest match possible.

My data frame looks like this:

index <- seq(1, 10, 1)
blockID <- c(100, 120, 132, 133, 201, 207, 210, 238, 240, 256)
df <- as.data.frame(cbind(index, blockID))

   index blockID
1      1     100
2      2     120
3      3     132
4      4     133
5      5     201
6      6     207
7      7     210
8      8     238
9      9     240
10    10     256

I want to mutate a new column that checks whether blockID is in a list. If yes, it should just keep the value of blockID. If not, It should return the nearest value in blocklist:

blocklist <- c(100, 120, 130, 150, 201, 205, 210, 238, 240, 256) 

so the additional column should contain

100 (match), 
120 (match), 
130 (no match for 132--nearest value is 130), 
130 (no match for 133--nearest value is 130), 
201, 
205 (no match for 207--nearest value is 205), 
210, 
238, 
240, 
256 

Here's what I've tried:

df2 <- df %>% mutate(blockmatch = ifelse(blockID %in% blocklist, blockID, ifelse(match.closest(blockID, blocklist, tolerance = Inf), "missing")))

I just put in "missing" to complete the ifelse() statements, but it shouldn't actually be returned anywhere since the preceding cases will be fulfilled for every value of blockID. However, the resulting df2 just has "missing" in all the cells where it should have substituted the nearest number. I know there are base R alternatives to match.closest but I'm not sure that's the problem. Any ideas?

回答1:

You don't need if..else. Your rule can simplified by saying that we always get the blocklist element with least absolute difference when compared to blockID. If values match then absolute difference is 0 (which will always be the least).

With that here's a simple base R solution -

df$blockmatch <- sapply(df$blockID, function(x) blocklist[order(abs(x - blocklist))][1])

   index blockID blockmatch
1      1     100        100
2      2     120        120
3      3     132        130
4      4     133        130
5      5     201        201
6      6     207        205
7      7     210        210
8      8     238        238
9      9     240        240
10    10     256        256

Here are a couple of ways with dplyr -

df %>% 
  rowwise() %>% 
  mutate(
    blockmatch = blocklist[order(abs(blockID - blocklist))][1]
  )

df %>% 
  mutate(
    blockmatch = sapply(blockID, function(x) blocklist[order(abs(x - blocklist))][1])
  )

Thanks to @Onyambu, here's a faster way -

df$blockmatch <- blocklist[max.col(-abs(sapply(blocklist, '-', df$blockID)))]


标签: r dplyr match