I'm trying to use the “stringdist” to fuzzy ma

2019-06-08 12:38发布

问题:

I have a data frame with 5 million different company names, many of them refer to the same company spelled in different ways or with misspellings. I use a company name "Amminex" as an example here and then try to stringdist it to the 5 million company names:

Companylist <- data.frame(Companies=c('AMMINEX'))

This is my big list of company names that I open:

Biglist <- data.frame(name=c(Biglist[,]))

I put AMMINEX and the 5 million companies in one matrix:

Matches <- expand.grid(Companylist$Companies,Biglist$name.Companiesnames)

Change the column names:

names(Matches) <- c("Companies","CompaniesList")

I use the stringdist with the method cosine:

Matches$dist <- stringdist(Matches$Companies,Matches$CompaniesList, method="cosine")

I remove all distances that are above 0.2 to get rid of bad matches:

Matches_trimmed <- Matches[!(Matches$dist>0.2),]

I sort by the distance column so best matches appear on the top:

Matches_trimmed <- Matches_trimmed[with(Matches_trimmed, order(dist)), ]

As you can see here, the results are not very satisfactory:

The first row is good, but then a bunch of bad matches appear before finally at the bottom I get the matches "AMMINEX AS" which are good.

This doesn't really work out for me. Is there any way I can improve this fuzzy matching or maybe use a different method for better results? Maybe a method that will look the order in which the letters appear in the strings?