R conditional replace more columns by lookup

2019-05-22 17:50发布

Lets say we do have lots of data columns (with names mycols and also some unnamed ones that should not be processed in this case) in dataframe df1 and a column subj which is also an index to another dataframe df2 with columns repl and subj (in this second dataframe is subj unique) and much other nonimportant columns (their only role in this is, that we cannot suppose that there are just 2 columns).

I would like to replace a subset of columns ( df1[,mycols] ) in such a way, that if there is an NA ( df1[,mycols][is.na(df1[,mycols])] ) <- replace by a value of column df2$repl where the row in df2 has df2$subj = df1$subj.

EDIT: example data (I dont know the command to write it as dataframe assignment):

mycols = c("a","b")
df1:
subj a  b  c
1    NA NA 1
1    2  3  5
2    0  NA 2
3    8  8  8
df2:
subj repl notinterested
1     5    1000
2     6    0
3     40   10
result:
df1-transformed-to:
subj a  b  c
1    5  5  1      #the 2 fives appeared by lookup
1    2  3  5
2    0  6  2     #the 6 appeared
3    8  8  8

I came up with the following code:

df1[,mycols][is.na(df1[,mycols])] <- df2[match( df1$subj, df2$subj),"repl"] 

But the problem is (I think), that the right side is not the same size as the left side - I think it might work for one column in "mycols", but I want to do the same operation with all mycols (If NA, look to table df2 and replace - the replacing value is the same in the scope of the row).

(Also I need to enumerate the columns by names mycols explicitely everythime, because there might be another columns)

As a miniquestion as bonus about programming style - what is, in R, a good and a fast way to write this operation? If it would be a procedural language, we could transform

df1[,mycols][is.na(df1[,mycols])]

into an approach I consider more nice and more readable:

function(x){ *x[is.na(*x)] }
function(& df1[,mycols]) 

and being sure, that nothing gets unnecessarily copied from place to place.

3条回答
放荡不羁爱自由
2楼-- · 2019-05-22 18:18

Using your code, we need to replicate the 'repl' column to make the two subset datasets equal and then assign the values as you did

 val <- df2$repl[match(df1$subj, df2$subj)][row(df1[mycols])][is.na(df1[mycols])]
 df1[mycols][is.na(df1[mycols])] <- val
 df1
 #  subj a b c
 #1    1 5 5 1
 #2    1 2 3 5
 #3    2 0 6 2
 #4    3 8 8 8

Another option using data.table

 library(data.table)#v1.9.5+
 DT <- setDT(df1, key='subj')[df2[c('subj', 'repl')]]
 for(j in mycols){
   i1 <- which(is.na(DT[[j]]))
   set(DT, i=i1, j=j, value= DT[['repl']][i1])
   }
 DT[,repl:= NULL]
 #   subj a b c
 #1:    1 5 5 1
 #2:    1 2 3 5
 #3:    2 0 6 2
 #4:    3 8 8 8

Or with dplyr

 library(dplyr)
 left_join(df1, df2, by='subj') %>%
        mutate_each_(funs(ifelse(is.na(.),repl,.)), mycols) %>% 
        select(a:c)
 #  a b c
 #1 5 5 1
 #2 2 3 5
 #3 0 6 2
 #4 8 8 8

data

 df1 <-  structure(list(subj = c(1L, 1L, 2L, 3L), a = c(NA, 2L, 0L, 8L 
 ), b = c(NA, 3L, NA, 8L), c = c(1L, 5L, 2L, 8L)), .Names = c("subj", 
 "a", "b", "c"), class = "data.frame", row.names = c(NA, -4L))

 df2 <- structure(list(subj = 1:3, repl = c(5L, 6L, 40L),
 notinterested = c(1000L, 
 0L, 10L)), .Names = c("subj", "repl", "notinterested"), 
 class = "data.frame", row.names = c(NA, -3L))
查看更多
啃猪蹄的小仙女
3楼-- · 2019-05-22 18:30

Here's a possible solution using ifelse():

mycols <- c('a','b');
df1 <- data.frame(subj=c(1,1,2,3), a=c(NA,2,0,8), b=c(NA,3,NA,8), c=c(1,5,2,8) );
df2 <- data.frame(subj=c(1,2,3), repl=c(5,6,40), notinterested=c(1000,0,10) );
df1[mycols] <- ifelse(is.na(df1[mycols]),matrix(df2[match(df1$subj,df2$subj),'repl'],nrow(df1),length(mycols)),as.matrix(df1[mycols]));
df1;
##   subj a b c
## 1    1 5 5 1
## 2    1 2 3 5
## 3    2 0 6 2
## 4    3 8 8 8
查看更多
老娘就宠你
4楼-- · 2019-05-22 18:42

One way of doing this with base R:

mycols = c("a","b")
df1 <- read.table(text="subj a  b  c
1    NA NA 1
1    2  3  5
2    0  NA 2
3    8  8  8", header = TRUE)
df2 <- read.table(text="subj repl notinterested
1     5    1000
2     6    0
3     40   10", header = TRUE)
df1[mycols] <- lapply(df1[mycols], function(x) {
  x[is.na(x)] <- df2$repl[match(df1$subj[is.na(x)], df2$subj)]; x})
查看更多
登录 后发表回答