This question already has an answer here:
I've been playing with the below script:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import textract
import os
folder_to_scan = '/media/sf_Documents/clustering'
dict_of_docs = {}
# Gets all the files to scan with textract
for root, sub, files in os.walk(folder_to_scan):
for file in files:
full_path = os.path.join(root, file)
print(f'Processing {file}')
try:
text = textract.process(full_path)
dict_of_docs[file] = text
except Exception as e:
print(e)
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(dict_of_docs.values())
true_k = 3
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i,)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind],)
It scans a folder of images that are scanned documents, extracts the text then clusters the text. I know for a fact there are 3 different types of documents, so I set the true_k to 3. But what if I had a folder of unknown documents where there could be anythings from 1 to 100s of different document types.
This is a slippery field because it is very difficult to measure how "good" your clustering algorithm works without any ground truth labels. In order to make an automatic selection, you need to have a metrics that will compare how
KMeans
performs for different values ofn_clusters
.A popular choice is the silhouette score. You can find more details about it here. Here is the
scikit-learn
documentation:As a result, you can only compute the silhouette score for
n_clusters >= 2
, (which might be a limitation for you given your problem description unfortunately).This is how you would use it on a dummy data set (you can adapt it to your code then, it is just to have a reproducible example):
This will return:
And thus you will have
best_n_clusters = 2
(NB: in reality, Iris has three classes...)