I have a data frame with 5 million different company names, many of them refer to the same company spelled in different ways or with misspellings. I use a company name "Amminex" as an example here and then try to stringdist it to the 5 million company names:
Companylist <- data.frame(Companies=c('AMMINEX'))
This is my big list of company names that I open:
Biglist <- data.frame(name=c(Biglist[,]))
I put AMMINEX and the 5 million companies in one matrix:
Matches <- expand.grid(Companylist$Companies,Biglist$name.Companiesnames)
Change the column names:
names(Matches) <- c("Companies","CompaniesList")
I use the stringdist with the method cosine:
Matches$dist <- stringdist(Matches$Companies,Matches$CompaniesList, method="cosine")
I remove all distances that are above 0.2 to get rid of bad matches:
Matches_trimmed <- Matches[!(Matches$dist>0.2),]
I sort by the distance column so best matches appear on the top:
Matches_trimmed <- Matches_trimmed[with(Matches_trimmed, order(dist)), ]
As you can see here, the results are not very satisfactory:
The first row is good, but then a bunch of bad matches appear before finally at the bottom I get the matches "AMMINEX AS" which are good.
This doesn't really work out for me. Is there any way I can improve this fuzzy matching or maybe use a different method for better results? Maybe a method that will look the order in which the letters appear in the strings?