IPython Notebook Kernel getting dead while running

2019-06-21 16:13发布

I am running K-means clustering on some 400K observations with 12 variables. Initially as soon as I run the cell with Kmeans code, it would pop up a message after 2 mins saying the kernel is interrupted and would restart. And then it takes ages like as if the kernel got dead and the code won't run anymore.

So I tried with 125k observations and same no. of variables. But still the same message I got.

What is meant by that?. Does it mean ipython notebook is not able to run kmeans on 125k observations and kills the kernel?.

How to solve this?. This is pretty important for me to do by today. :(

Please advise.

Code I used:

from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=100)
kmeans.fit(Data_sampled.ix[:,1:])
cluster_labels = kmeans.labels_
    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
silhouette_avg = silhouette_score(Data_sampled.ix[:,1:],cluster_labels)

1条回答
男人必须洒脱
2楼-- · 2019-06-21 16:34

From some investigation, this likely has nothing to do with iPython Notebook / Jupyter. It seems this is an issue with sklearn, which traces back to an issue with numpy. See related github issues sklearn here and here, and the underlying numpy issue here.

Ultimately, calculating the Silhouette Score requires calculating a very large distance matrix, and it seems that distance matrix is taking up too much memory on your system for large numbers of rows. For instance, look at memory pressure on my system (OSX, 8GB ram) during two runs of a similar calculation - the first spike is a Silhouette Score calculation with 10k records, the second ... plateau .. is with 40k records:

memory pressure

Per the related SO answer here, your kernel process is probably getting killed by the OS because it is taking too much memory.

Ultimately, this is going to require some fixes in the underlying codebase for sklearn and/or numpy. Some options that you can try in the interim:

  • close every extraneous program running on your computer (spotify, slack, etc.), hope that frees up enough memory, and monitor memory closely while your script is running
  • run the calculation on a temporary remote server with more RAM than your machine has and see if that helps (although since I think the memory use is at least polynomial with respect to the number of samples, this may not work)
  • train your classifier with your full data set, but then calculate silhouette scores with a random subset of your data. (most people seem to be able to get this working with 20-30k observations)

Or, if you're smarter than me and have some free time, consider trying out contributing a fix to sklearn and/or numpy :)

查看更多
登录 后发表回答