I am running K-means clustering on some 400K observations with 12 variables. Initially as soon as I run the cell with Kmeans code, it would pop up a message after 2 mins saying the kernel is interrupted and would restart. And then it takes ages like as if the kernel got dead and the code won't run anymore.
So I tried with 125k observations and same no. of variables. But still the same message I got.
What is meant by that?. Does it mean ipython notebook is not able to run kmeans on 125k observations and kills the kernel?.
How to solve this?. This is pretty important for me to do by today. :(
Please advise.
Code I used:
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=100)
kmeans.fit(Data_sampled.ix[:,1:])
cluster_labels = kmeans.labels_
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(Data_sampled.ix[:,1:],cluster_labels)
From some investigation, this likely has nothing to do with iPython Notebook / Jupyter. It seems this is an issue with
sklearn
, which traces back to an issue withnumpy
. See related github issuessklearn
here and here, and the underlying numpy issue here.Ultimately, calculating the Silhouette Score requires calculating a very large distance matrix, and it seems that distance matrix is taking up too much memory on your system for large numbers of rows. For instance, look at memory pressure on my system (OSX, 8GB ram) during two runs of a similar calculation - the first spike is a Silhouette Score calculation with 10k records, the second ... plateau .. is with 40k records:
Per the related SO answer here, your kernel process is probably getting killed by the OS because it is taking too much memory.
Ultimately, this is going to require some fixes in the underlying codebase for
sklearn
and/ornumpy
. Some options that you can try in the interim:Or, if you're smarter than me and have some free time, consider trying out contributing a fix to
sklearn
and/ornumpy
:)