R replacing columns by lookup to dictionary

2019-07-13 02:30发布

问题:

In this question I need to be able to lookup a value from a dataframe's column not only based on one attribute, but based on more attributes and range comparing against a dictionary. (Yes, this is actually a continuation of a story in R conditional replace more columns by lookup )

It should be easy question for R-known ppl, because I provide working solution for basic indexing, that needs to be upgraded, possibly easily ... but it is very hard for me, because Iam in a process of learning R.

From where to start:

When I do want to replace missing values from columns testcolnames from (big) table df1 according to column default of (small) dictionary testdefs (row selected by making testdefs$LABMET_ID equal to column name from testcolnames), I use this code:

testcolnames=c("80","116") #...result of regexp on colnames(df1), originally much longer

df1[,testcolnames] <- lapply(testcolnames, function(x) { tmpcol<-df1[,x];
  tmpcol[is.na(tmpcol)] <- testdefs$default[match(x, testdefs$LABMET_ID)];
  tmpcol  }) 

To where to go:

Now - I need to upgrade this solution. The table testdefs will have (example below) multiple rows of the same LABMET_ID differing only by new two columns called lower and upper ... which need to be the bounds for variable df1$rngvalue when selecting which value to replace.

In another words - to upgrade this solution to not only select the row from testdefs (where testdefs$LABMET_ID equals the column name), but to select from these rows such a row, where df1$rngvalue is in the bounds of testdefs$lower and testdefs$upper (if none such exists, take the range closest - either the lowest or the highest, if the dictionary doesnt have LABMET_ID, we can leave NA in the original data).

An example:

testdefs

"LABMET_ID","lower","upper","default","notuse","notuse2"
30,0,54750,25,80,2            #..."many columns we dont care about"
46,0,54750,1.45,3.5,0.2
80,0,54750,0.03,0.1,0.01
116,0,30,0.09,0.5,0.01
116,31,365,0.135,0.7,0.01
116,366,5475,0.11,0.7,0.01
116,5476,54750,0.105,0.7,0.02

df1:

"rngvalue","80","116"
36,NA,NA
600000,NA,NA
367,5,NA
90,NA,6

to be transformed into:

"rngvalue","80","116"
36,0.03,0.135                   #col80 is always replaced by 0.03
600000,0.03,0.105               #col116 needs to be decided on range, this value is bigger than everything in dictionary so take the last one
367,5,0.11                      #5 not replaced, but second column nicely looks up to 0.11
90,0.03,6                       #6 not replaced

回答1:

Since the intervals don't have gaps, you can use findInterval. I would change the lookup table to a list containing the break points and defaults for each value using dlply from plyr.

## Transform lookup table to a list with breaks for intervals
library(plyr)
lookup <- dlply(testdefs, .(LABMET_ID), function(x)
    list(breaks=c(rbind(x$lower, x$upper), x$upper[length(x$upper)])[c(T,F)],
         default=x$default))

So, the lookups now look like

lookup[["116"]]
# $breaks
# [1]     0    31   366  5476 54750
# 
# $default
# [1] 0.090 0.135 0.110 0.105

Then, you can do the lookup with the following

testcolnames=c("80","116")

df1[,testcolnames] <- lapply(testcolnames, function(x) {
    tmpcol <- df1[,x]
    defaults <- with(lookup[[x]], {
        default[pmax(pmin(length(breaks)-1, findInterval(df1$rngvalue, breaks)), 1)]
    })
    tmpcol[is.na(tmpcol)] <- defaults[is.na(tmpcol)]
    tmpcol
})

#   rngvalue   80   116
# 1       36 0.03 0.135
# 2   600000 0.03 0.105
# 3      367 5.00 0.110
# 4       90 0.03 6.000

The findInterval returns values below and above the number of breaks if the rngvalue is outside of the range. That is the reason for the pmin and pmax in the code above.