I'm performing a cluster analysis on categorical data, hence using k-modes approach.
My data is shaped as a preference survey: How do you like hair and eyes?
The respondent can pick up an answers from a fixed (multiple choice) set of 4 possibility.
I therefore get the dummies, apply k-modes, attach the clusters back to the initial df and then plot them in 2D with pca.
My code looks like:
import numpy as np
import pandas as pd
from kmodes import kmodes
df_dummy = pd.get_dummies(df)
#transform into numpy array
x = df_dummy.reset_index().values
km = kmodes.KModes(n_clusters=3, init='Huang', n_init=5, verbose=0)
clusters = km.fit_predict(x)
df_dummy['clusters'] = clusters
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca = PCA(2)
# Turn the dummified df into two columns with PCA
plot_columns = pca.fit_transform(df_dummy.ix[:,0:12])
# Plot based on the two dimensions, and shade by cluster label
plt.scatter(x=plot_columns[:,1], y=plot_columns[:,0], c=df_dummy["clusters"], s=30)
plt.show()
and I can visualize:
Now my problem is: Can somehow reveal the distinctive feature of each cluster? ie, what are the main characteristics (maybe blond hair and blue eyes) of the group of green dots in the scatterplot?
I get the clustering has happened, but I can't find a way to translate what the clustering actually means.
Should I play with the .labels_ object?
Take a look at
km.cluster_centroids_
. This will give the mode of each variable for each cluster.