I'm trying to do a clustering with K-means method but I would like to measure the performance of my clustering.
I'm not an expert but I am eager to learn more about clustering.
Here is my code :
import pandas as pd
from sklearn import datasets
#loading the dataset
iris = datasets.load_iris()
df = pd.DataFrame(iris.data)
#K-Means
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(df) #K-means training
y_pred = k_means.predict(df)
#We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Species']
#we merge this dataframe with df
prediction = pd.concat([df,pred], axis = 1)
#We store the clusters
clus0 = prediction.loc[prediction.Species == 0]
clus1 = prediction.loc[prediction.Species == 1]
clus2 = prediction.loc[prediction.Species == 2]
k_list = [clus0.values, clus1.values,clus2.values]
Now that I have my KMeans and my three clusters stored, I'm trying to use the Dunn Index to measure the performance of my clustering (we seek the greater index)
For that purpose I import the jqm_cvi package (available here)
from jqmcvi import base
base.dunn(k_list)
My question is : does any clustering internal evaluation already exists in Scikit Learn (except from silhouette_score) ? Or in another well known library ?
Thank you for your time
Normally, clustering is considered as an Unsupervised method, thus is difficult to establish a good performance metric (as also suggested in the previous comments).
Nevertheless, much useful information can be extrapolated from these algorithms (e.g. k-means). The problem is how to assign a semantics to each cluster, and thus measure the "performance" of your algorithm. In many cases, a good way to proceed is through a visualization of your clusters. Obviously, if your data have high dimensional features, as in many cases happen, the visualization is not that easy. Let me suggest two way to go, using k-means and another clustering algorithm.
K-mean: in this case, you can reduce the dimensionality of your data by using for example PCA. Using such algorithm, you can plot the data in a 2D plot and then visualize your clusters. However, what you see in this plot is a projection in a 2D space of your data, so can be not very accurate, but still can give you an idea of how your clusters are distributed.
Self-organizing map this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM). You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. This algorithm is very good for clustering also because does not require a priori selection of the number of cluster (in k-mean you need to choose k, here no).
As you said, only Silhouette Coefficient and Calinski-Harabaz Index exist in scikit-learn. For Dunn index you may use either this or this link.
Apart from Silhouette Score, Elbow Criterion can be used to evaluate K-Mean clustering. It is not available as a function/method in Scikit-Learn. We need to calculate SSE to evaluate K-Means clustering using Elbow Criterion.
The idea of the Elbow Criterion method is to choose the k
(no of cluster) at which the SSE decreases abruptly. The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid.
Calculate Sum of Squared Error(SSE) for each value of k
, where k
is no. of cluster
and plot the line graph. SSE tends to decrease toward 0 as we increase k (SSE=0, when k is equal to the no. of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster).
So the goal is to choose a small value of k
that still has a low SSE
, and the elbow usually represents, where we start to have diminishing returns by increasing k.
Iris dataset example:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris['feature_names'])
#print(X)
data = X[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)']]
sse = {}
for k in range(1, 10):
kmeans = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
#print(data["clusters"])
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
If the line graph looks like an arm - a red circle in above line graph (like angle), the "elbow" on the arm is the value of optimal k
(number of cluster). According to above elbow in line graph, number of optimal cluster is 3.
Note: Elbow Criterion is heuristic in nature, and may not work for your data set. Follow intuition according to dataset and the problem your are trying to solve.
Hope it helps!