Create a unique ID by fuzzy matching of names (via

2019-02-06 20:28发布

问题:

Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names.

A sample chunk of the dataset is structured as follows:

df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013",                             "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))

The neat version:

  citycode year                      candidate
1  1200013 1996      AGUSTINHO FORTUNATO FILHO
2  1200013 1996           ANTONIO PEREIRA NETO
3  1200013 1996         FERNANDO JOSE DA COSTA
4  1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO
5  1200013 2000 PAULO CESAR FERREIRA DE ARAUJO
6  1200013 2000    SEBASTIAO BOCALOM RODRIGUES
7  1200013 2004                JOAO DE ALMEIDA
8  1200013 2004 PAULO CESAR FERREIRA DE ARAUJO

I'd like to check in each city separately, whether there are candidates appearing in several years. E.g. in the example,

PAULO CEZAR FERREIRA DE ARAUJO

PAULO CESAR FERREIRA DE ARAUJO

appears twice (with a spelling mistake). Each candidate across the entire data set should be assigned a unique numeric candidate ID. The dataset is fairly large (5500 cities, approx. 100K entries) so a somewhat efficient coding would be helpful. Any suggestions as to how to implement this?

EDIT: Here is my attempt (with help from the comments thus far) that is very slow (inefficient) in achieving the task at hand. Any suggestions as to improvements to this?

f <- function(x) {matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
                  levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
                  x
                }

temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)

EDIT 2: Now running at good speed. Problem was the comparison to many factors at every step (Thanks for pointing that out, Blue Magister). Reducing the comparison to only the candidates in one group (i.e. a city) runs the command in 5 seconds for 80,000 lines - a speed I can live with.

df$candidate <- as.character(df$candidate)

f <- function(x) {x <- as.factor(x)
                  matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
                  levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
                  as.character(x)
                }

temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)

回答1:

Here's my shot at it. It's probably not very efficient, but I think it will get the job done. I assume that df$candidates is of class factor.

#fuzzy matches candidate names to other candidate names
#compares each pair of names only once
##by looking at names that have a greater index
matches <- unlist(lapply(1:(length(levels(df[["candidate"]]))-1),
    function(x) {max(x,x + agrep(
        pattern=levels(df[["candidate"]])[x], 
        x=levels(df[["candidate"]])[-seq_len(x)]
    ))}
))
#assigns new levels (omits the last level because that doesn't change)
levels(df[["candidate"]])[-length(levels(df[["candidate"]]))] <- 
    levels(df[["candidate"]])[matches]


回答2:

Ok, given that the focus is on the efficiency, I'd suggest the following.

First, note that in order of efficiency from first principles we could predict that exact matching will be much faster than grep which will be faster than fuzzy grep. So exact match, then fuzzy grep for the remaining observations.

Second, vectorize and avoid loops. The apply commands aren't necessarily faster, so stick to native vectorization if you can. All the grep commands are natively vectorized, but it's going to be hard to avoid a *ply or loop to compare each element to the vector of others to match to.

Third, use outside information to narrow the problem down. Do fuzzy matching on names only within each city or state, which will dramatically reduce the number of comparisons which must be made, for instance.

You can combine the first and third principles: You might even try exact matching on the first character of each string, then fuzzy matching within that.