Joining two datasets using fuzzy logic in R

2019-03-31 23:01发布

问题:

I’m trying to do a fuzzy logic join in R between two datasets:

  • first dataset has the name of a location and a column called config
  • second dataset has the name of a location and two additional attributes that need to be summarized before they are joined to the first dataset.

I would like to use the name column to join between the two datasets. However the name column may have additional or leading characters in either dataset or have one word contained inside of a larger word. So for example if we looked at these two datasets, I'd like the name OPAL to join to the OPALAS, and SAUSALITO Y to join to SAUSALITO.

Dataset1:    
     Name           Config
     ALTO D         BB
     CONTRA         ST
     EIGHT A        DD
     OPALAS         BB
     SAUSALITO Y    AA
     SOLANO J       ST

Dataset2:    
    Name       Age     Rank
    ALTO D     50      2
    ALTO D     20      6
    CONTRA     10      10
    CONTRA     15      15
    EIGHTH     18      21
    OPAL       19      4
    SAUSALITO  2       12
    SOLANO     34      43

Code for Summarizing of Dataset2

Data2a <-summaryBy(Age ~ Name,FUN=c(mean), data=Data2,na.rm=TRUE)
Data2b<-summaryBy(Rank ~ Name,FUN=c(sum), data=Data2,na.rm=TRUE)
Data2<-data.frame(Data2a$Name, Data2a$Age.mean, Data2b$Rank.sum)

Desired Outcome:
    Name        Config  Age   Rank
    ALTO D      BB      35    8
    CONTRA      ST      12.5  25
    EIGHT A     DD      18    21
    OPALAS      BB      19    4
    SAUSALITO Y AA      12    5
    SOLANO J    ST      34    43

回答1:

I was able to join the two datasets, using the fuzzylogic package:

library(fuzzyjoin)
stringdist_inner_join(Dataset1, Data2, by ="Name", distance_col = NULL)