R - Merging two data files based on partial matchi

2019-06-05 03:57发布

Here is my previous question reposted with R format.

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.

For example, I have the following two data files:

File name: Employee Data (df1 in R)

       Full.Name Date.Started Orders
1  ANGELA MUIR     6/15/14 25     44
2  EILEEN COWIE      6/15/14      40
3  LAURA CUMMING     10/6/14      43
4    ELENA POPA       1/21/15     37
5 KAREN MACEWAN       3/15/99     39

File name: Assessment data (df2 in R)

           Candidate Leading.Factor SI.D SI.I
1         Angie muir              I   -3   12
2      Caroline Burn              S   -5   -3
3  Eileen Mary Cowie              S   -5    5
4         Elena Pope              C   -4    7
5     Henry LeFeuvre              C   -5   -1
6      Jennifer Ford              S   -3   -2
7       Karen McEwan              I   -4   10
8      Laura Cumming              S    0    6
9       Mandip Johal              C   -2    2
10   Mubarak Hussain              D    6   -1

I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.

The ideal output would look like this:

              Name      Full.Name         Candidate Date.Started Orders Leading.Factor SI.D SI.I
1    ANGELA MUIR    ANGELA MUIR          Angie muir   6/15/14 25     44              I   -3   12
2    EILEEN COWIE   EILEEN COWIE  Eileen Mary Cowie     6/15/14      40              S   -5    5
3    LAURA CUMMING  LAURA CUMMING     Laura Cumming     10/6/14      43              S    0    6
4      ELENA POPA     ELENA POPA         Elena Pope      1/21/15     37              C   -4    7
5   KAREN MACEWAN  KAREN MACEWAN       Karen McEwan      3/15/99     39              I   -4   10
6    Caroline Burn            N/A     Caroline Burn          N/A    N/A              S   -5   -3
7   Henry LeFeuvre            N/A    Henry LeFeuvre          N/A    N/A              C   -5   -1
8    Jennifer Ford            N/A     Jennifer Ford          N/A    N/A              S   -3   -2
9     Mandip Johal            N/A      Mandip Johal          N/A    N/A              C   -2    2
10 Mubarak Hussain            N/A   Mubarak Hussain          N/A    N/A              D    6   -1

Any suggestions would be greatly appreciated!

1条回答
闹够了就滚
2楼-- · 2019-06-05 04:57

For a first pass, I would suggest a two-stage process.

First, clean your strings. Normalize the casing, strip out extra spaces, strip out any unwanted characters. A function I use for a fairly aggressive cleaning is below:

stringCleaning <- function(x) {
#   x <- stringr::str_trim(x)
#   x <- tolower(x)
#   x <- gsub("\\s+", " ", x)
#   x <- gsub("[^[:space:]A-z0-9]", "", x)
  stringr::str_trim(tolower(gsub("\\s+", " ", gsub("[^[:space:]A-z0-9]", "", x))))
}

This converts strings to lowercase, strips out any non-alphanumeric or string characters, strips out extra spaces, and removes spaces on either side of the string.

Two, use Levenshtein (or edit) distances to find your closest matches. The stringdist package includes a simple distance calculator to help you.

stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother'))
min(stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother')))
which.min(stringdist::stringdist('your mother', c("bellow", "your mom", 'yourmother')))

You can use this function to find the most appropriate match in another dataframe.

df1 <- data.frame(name = c("Jena Stars", "Gina Starz"))
df2 <- data.frame(name = c("gina starz", "Jena starz  "))

df1$clean <- stringCleaning(df1$name)
df2$clean <- stringCleaning(df2$name)

df1$check <- df2$name[sapply(df1$clean, function(x) {
  which.min(stringdist::stringdist(x, df2$clean))
  })]
df1

A small example, but I hope it's helpful.

查看更多
登录 后发表回答