Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names.
A sample chunk of the dataset is structured as follows:
df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013", "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))
The neat version:
citycode year candidate
1 1200013 1996 AGUSTINHO FORTUNATO FILHO
2 1200013 1996 ANTONIO PEREIRA NETO
3 1200013 1996 FERNANDO JOSE DA COSTA
4 1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO
5 1200013 2000 PAULO CESAR FERREIRA DE ARAUJO
6 1200013 2000 SEBASTIAO BOCALOM RODRIGUES
7 1200013 2004 JOAO DE ALMEIDA
8 1200013 2004 PAULO CESAR FERREIRA DE ARAUJO
I'd like to check in each city separately, whether there are candidates appearing in several years. E.g. in the example,
PAULO CEZAR FERREIRA DE ARAUJO
PAULO CESAR FERREIRA DE ARAUJO
appears twice (with a spelling mistake). Each candidate across the entire data set should be assigned a unique numeric candidate ID. The dataset is fairly large (5500 cities, approx. 100K entries) so a somewhat efficient coding would be helpful. Any suggestions as to how to implement this?
EDIT: Here is my attempt (with help from the comments thus far) that is very slow (inefficient) in achieving the task at hand. Any suggestions as to improvements to this?
f <- function(x) {matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
x
}
temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)
EDIT 2: Now running at good speed. Problem was the comparison to many factors at every step (Thanks for pointing that out, Blue Magister). Reducing the comparison to only the candidates in one group (i.e. a city) runs the command in 5 seconds for 80,000 lines - a speed I can live with.
df$candidate <- as.character(df$candidate)
f <- function(x) {x <- as.factor(x)
matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
as.character(x)
}
temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)
Ok, given that the focus is on the efficiency, I'd suggest the following.
First, note that in order of efficiency from first principles we could predict that exact matching will be much faster than grep which will be faster than fuzzy grep. So exact match, then fuzzy grep for the remaining observations.
Second, vectorize and avoid loops. The
apply
commands aren't necessarily faster, so stick to native vectorization if you can. All thegrep
commands are natively vectorized, but it's going to be hard to avoid a*ply
or loop to compare each element to the vector of others to match to.Third, use outside information to narrow the problem down. Do fuzzy matching on names only within each city or state, which will dramatically reduce the number of comparisons which must be made, for instance.
You can combine the first and third principles: You might even try exact matching on the first character of each string, then fuzzy matching within that.
Here's my shot at it. It's probably not very efficient, but I think it will get the job done. I assume that
df$candidates
is of class factor.