Find matching groups of strings in R

2019-08-08 19:23发布

I have a vector of about 8000 strings. Each element in the vector is a company name.

My Objective

My objective is to cluster these company names into groups, so that each cluster contains a group of company names that are similar to each other (For example: ROYAL DUTCH SHELL, SHELL USA, BMCC SHELL etc... will belong to the same group/cluster, as they are all Shell-based companies i.e. they have the word 'Shell' in their names).

When dealing with a vector of this size, it seems to be taking forever to find groups of similar company names using the clustering technique I've employed. However on smaller vectors, this method works well.

Let me demonstrate my approach using an example vector of company names, which is much smaller than the original one.

With a small vector of strings, this approach works very well.

The vector looks something like this

string=c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
     "GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")

My Attempt

In order to tackle this problem, I used a hierarchical clustering method.

# load packages
pacman::p_load(stringdist, dplyr, tm, gplots)

But some prep work first

#Function to clean strings
str_clean <- function(strings) {
 require(dplyr)
 require(tm)
 strings %>% tolower() %>% removePunctuation() %>% stripWhitespace() %>% 
 trim()
}

# Clean company names
clean_names = str_clean(string)

n = length(clean_names)

Now to calculate the distances between words, to be used for clustering

# Distance methods
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3)  #size of q-gram

dist.methods <- list()

# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
dist = matrix(NA, ncol = n, nrow = n)  #initialize empty matrix
# row.names(dist) = prods
for (i in 1:n) {
for (j in 1:n) {
  dist[i, j] <- stringdist(clean_names[i], clean_names[j], method = methods[m], 
                           q = q[m])
}
}
 dist.methods[[m]] <- dist
 }

Once the distance calculations are done, I choose a method and set an appropriate cut-off

#hierarchical clustering with cut-off of 0.2
clusters <- hclust(as.dist(dist.methods[[3]]))
plot(clusters)
df=as.data.frame(cbind("Companies" = clean_names, "Cluster" = cutree(clusters, h = .99)))

The resulting dataframe has all the company names categorized into clusters, just like I wanted.

df=df %>% group_by(Cluster)

However, like I mentioned, when I use my original vector of 8000 company names, the distance calculations take too long and I cannot proceed.

My Question

Is there a work-around for this method, when I am working with a larger vector of strings?

Maybe for larger vectors, clustering is not the right solution to this problem? In which case, what else could I do to achieve my result?

Any help would be greatly appreciated.

1条回答
霸刀☆藐视天下
2楼-- · 2019-08-08 19:34

Get rid of the inner two for loops that's what's slowing you down and use stringdistmatrix your vector is long but the strings are small you'll see the benchmark at the bottom.

library(stringdist)

strings <- c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
         "GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")
stringsBig <- rep(strings, 500)    
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3)  #size of q-gram    
dist.methods <- list()

# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
  dist.methods[[m]] <- stringdistmatrix(stringsBig, method = methods[[m]], q = q[[m]])
}

microbenchmark::microbenchmark(stringdistmatrix(stringsBig),
                           for (i in 1:length(strings)) {
                             for (j in 1:length(strings)) {
                              stringdist(strings[i], strings[j])
                             }
                           },times = 100)

# Unit: microseconds
# expr                          min         lq       mean     median        uq       max neval cld
# stringdistmatrix(strings) 105.212   131.2805   241.9271   251.2235   279.634  2909.624   100  a 
# for loop                36147.878 38165.8480 40411.9772 39527.5500 42170.895 54151.457   100   b

microbenchmark::microbenchmark(stringdistmatrix(stringsBig), times=10)
# Unit: seconds
# expr    min       lq    mean   median       uq      max neval
# stringdistmatrix(stringsBig) 1.5324 1.585354 1.66592 1.655901 1.691157 1.825333    10
查看更多
登录 后发表回答