I have a vector of about 8000 strings. Each element in the vector is a company name.
My Objective
My objective is to cluster these company names into groups, so that each cluster contains a group of company names that are similar to each other (For example: ROYAL DUTCH SHELL, SHELL USA, BMCC SHELL etc... will belong to the same group/cluster, as they are all Shell-based companies i.e. they have the word 'Shell' in their names).
When dealing with a vector of this size, it seems to be taking forever to find groups of similar company names using the clustering technique I've employed. However on smaller vectors, this method works well.
Let me demonstrate my approach using an example vector of company names, which is much smaller than the original one.
With a small vector of strings, this approach works very well.
The vector looks something like this
string=c("ROYAL DUTCH SHELL","Kremlin Prestige","Bateaux Mouches","Red Square Kremlin Inc","SHELL USA","KLM NEDERLAND","KLM GROUP","SHELL AUSTRALIA","BP TANGUH","LEROY MERLIN","SHELL AZERBAIJAN","BMCC SHELL",
"GAS PLANT BERLIN","SHELL AQUA MARINA","AUCHAN LEROY","GROUPE ROYAL TANGUH","klm hostel","SHELL","TANGUH TOWN","KPMG")
My Attempt
In order to tackle this problem, I used a hierarchical clustering method.
# load packages
pacman::p_load(stringdist, dplyr, tm, gplots)
But some prep work first
#Function to clean strings
str_clean <- function(strings) {
require(dplyr)
require(tm)
strings %>% tolower() %>% removePunctuation() %>% stripWhitespace() %>%
trim()
}
# Clean company names
clean_names = str_clean(string)
n = length(clean_names)
Now to calculate the distances between words, to be used for clustering
# Distance methods
methods <- c("lcs", "osa", "cosine")
q <- c(0, 0, 3) #size of q-gram
dist.methods <- list()
# create distance matrix for every pair of listing, for each method
for (m in 1:length(methods)) {
dist = matrix(NA, ncol = n, nrow = n) #initialize empty matrix
# row.names(dist) = prods
for (i in 1:n) {
for (j in 1:n) {
dist[i, j] <- stringdist(clean_names[i], clean_names[j], method = methods[m],
q = q[m])
}
}
dist.methods[[m]] <- dist
}
Once the distance calculations are done, I choose a method and set an appropriate cut-off
#hierarchical clustering with cut-off of 0.2
clusters <- hclust(as.dist(dist.methods[[3]]))
plot(clusters)
df=as.data.frame(cbind("Companies" = clean_names, "Cluster" = cutree(clusters, h = .99)))
The resulting dataframe has all the company names categorized into clusters, just like I wanted.
df=df %>% group_by(Cluster)
However, like I mentioned, when I use my original vector of 8000 company names, the distance calculations take too long and I cannot proceed.
My Question
Is there a work-around for this method, when I am working with a larger vector of strings?
Maybe for larger vectors, clustering is not the right solution to this problem? In which case, what else could I do to achieve my result?
Any help would be greatly appreciated.