merge two dataframe based on matching two exchanga

2020-07-11 06:51发布

I have two dataframe in R.

dataframe 1

A B C D E F G
1 2 a a a a a
2 3 b b b c c
4 1 e e f f e

dataframe 2

X Y Z
1 2 g
2 1 h
3 4 i
1 4 j

I want to match dataframe1's column A and B with dataframe2's column X and Y. It is NOT a pairwise comparsions, i.e. row 1 (A=1 B=2) are considered to be same as row 1 (X=1, Y=2) and row 2 (X=2, Y=1) of dataframe 2.

When matching can be found, I would like to add columns C, D, E, F of dataframe1 back to the matched row of dataframe2, as follows: with no matching as na.

Final dataframe

X Y Z C  D  E  F  G
1 2 g a  a  a  a  a 
2 1 h a  a  a  a  a
3 4 i na na na na na
1 4 j e  e  f  f  e

I can only know how to do matching for single column, however, how to do matching for two exchangable columns and merging two dataframes based on the matching results is difficult for me. Pls kindly help to offer smart way of doing this.

For the ease of discussion (thanks for the comments by Vincent and DWin (my previous quesiton) that I should test the quote.) There are the quota for loading dataframe 1 and 2 to R.

df1 <- data.frame(A = c(1,2,4), B=c(2,3,1), C=c('a','b','e'), 
                                D=c('a','b','e'), E=c('a','b','f'), 
                                F=c('a','c','f'), G=c('a','c', 'e'))

df2  <- data.frame(X = c(1,2,3,1), Y=c(2,1,4,4), Z=letters[7:10])

5条回答
可以哭但决不认输i
2楼-- · 2020-07-11 07:31

Here's another possible solution in base R. This solution cbind()s new key columns (K1 and K2) to both data.frames using the vectorized pmin() and pmax() functions to derive the canonical order of the key columns, and merges on those:

merge(cbind(df2,K1=pmin(df2$X,df2$Y),K2=pmax(df2$X,df2$Y)),cbind(df1,K1=pmin(df1$A,df1$B),K2=pmax(df1$A,df1$B)),all.x=T)[,-c(1:2,6:7)];
##   X Y Z    C    D    E    F    G
## 1 1 2 g    a    a    a    a    a
## 2 2 1 h    a    a    a    a    a
## 3 1 4 j    e    e    f    f    e
## 4 3 4 i <NA> <NA> <NA> <NA> <NA>

Note that the use of pmin() and pmax() is only possible for this problem because you only have two key columns; if you had more, then you'd have to use some kind of apply+sort solution to achieve the canonical key order for merging, similar to what @Andrie does in his helper function, which would work for any number of key columns, but would be less performant.

查看更多
别忘想泡老子
3楼-- · 2020-07-11 07:32

The following works, but no doubt can be improved.

I first create a little helper function that performs a row-wise sort on A and B (and renames it to V1 and V2).

replace_index <- function(dat){
  x <- as.data.frame(t(sapply(seq_len(nrow(dat)), 
    function(i)sort(unlist(dat[i, 1:2])))))
  names(x) <- paste("V", seq_len(ncol(x)), sep="")
  data.frame(x, dat[, -(1:2), drop=FALSE])
} 

replace_index(df1)

  V1 V2 C D E F G
1  1  2 a a a a a
2  2  3 b b b c c
3  1  4 e e f f e

This means you can use a straight-forward merge to combine the data.

merge(replace_index(df1), replace_index(df2), all.y=TRUE)

  V1 V2    C    D    E    F    G Z
1  1  2    a    a    a    a    a g
2  1  2    a    a    a    a    a h
3  1  4    e    e    f    f    e j
4  3  4 <NA> <NA> <NA> <NA> <NA> i
查看更多
对你真心纯属浪费
4楼-- · 2020-07-11 07:43

One approach would be to create an id key for matching that is order invariant.

# create id key to match
require(plyr)
df1 = adply(df1, 1, transform, id = paste(min(A, B),  "-", max(A, B)))
df2 = adply(df2, 1, transform, id = paste(min(X, Y),  "-", max(X, Y)))

# combine data frames using `match`
cbind(df2, df1[match(df2$id, df1$id),3:7])

This produces the output

X Y Z    id    C    D    E    F    G
1   1 2 g 1 - 2    a    a    a    a    a
1.1 2 1 h 1 - 2    a    a    a    a    a
NA  3 4 i 3 - 4 <NA> <NA> <NA> <NA> <NA>
3   1 4 j 1 - 4    e    e    f    f    e
查看更多
爷、活的狠高调
5楼-- · 2020-07-11 07:43

You could also join the tables both ways (X == A and Y == B, then X == B and Y == A) and rbind them. This will produce duplicate pairs where one way yielded a match and the other yielded NA, so you would then reduce duplicates by slicing only a single row for each X-Y combination, the one without NA if one exists.

library(dplyr)
m <- left_join(df2,df1,by = c("X" = "A","Y" = "B"))
n <- left_join(df2,df1,by = c("Y" = "A","X" = "B"))

rbind(m,n) %>%
  group_by(X,Y) %>%
  arrange(C,D,E,F,G) %>% # sort to put NA rows on bottom of pairs
  slice(1) # take top row from combination

Produces:

Source: local data frame [4 x 8]
Groups: X, Y

  X Y Z  C  D  E  F  G
1 1 2 g  a  a  a  a  a
2 1 4 j  e  e  f  f  e
3 2 1 h  a  a  a  a  a
4 3 4 i NA NA NA NA NA
查看更多
\"骚年 ilove
6楼-- · 2020-07-11 07:54

This is slightly clunky, and has some potential collision and order issues but works with your example

df1a <- df1; df1a$A <- df1$B; df1a$B <- df1$A #reverse A and B
merge(df2, rbind(df1,df1a), by.x=c("X","Y"), by.y=c("A","B"), all.x=TRUE)

to produce

  X Y Z    C    D    E    F    G
1 1 2 g    a    a    a    a    a
2 1 4 j    e    e    f    f    e
3 2 1 h    a    a    a    a    a
4 3 4 i <NA> <NA> <NA> <NA> <NA>
查看更多
登录 后发表回答