How to assign an new observation to existing Kmean

2019-03-04 04:32发布

问题:

I used the below code to create k-means clusters using Scikit learn.

kmean = KMeans(n_clusters=nclusters,n_jobs=-1,random_state=2376,max_iter=1000,n_init=1000,algorithm='full',init='k-means++')

kmean_fit = kmean.fit(clus_data)

I also have saved the centroids using kmean_fit.cluster_centers_

I then pickled the K means object.

filename = pickle_path+'\\'+'_kmean_fit.sav'
pickle.dump(kmean_fit, open(filename, 'wb'))

So that I can load the same kmeans pickle object and apply it to new data when it comes, using kmean_fit.predict().

Questions :

  1. Will the approach of loading kmeans pickle object and applying kmean_fit.predict() allow me to assign the new observation to existing clusters based on centroid of the existing clusters? Does this approach just recluster from scratch on the new data?

  2. If this method wont work how to assign the new observation to existing clusters given that I already have saved the cluster centriods using efficent python code?

PS: I know building a classifer using existing clusters as dependent variable is another way but I dont want to do that because of time crunch.

回答1:

Yes. Whether the sklearn.cluster.KMeans object is pickled or not (if you un-pickle it correctly, you'll be dealing with the "same" original object) does not affect that you can use the predict method to cluster a new observation.

An example:

from sklearn.cluster import KMeans
from sklearn.externals import joblib

model = KMeans(n_clusters = 2, random_state = 100)
X = [[0,0,1,0], [1,0,0,1], [0,0,0,1],[1,1,1,0],[0,0,0,0]]
model.fit(X)

Out:

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=100, tol=0.0001,
    verbose=0)

Continue:

joblib.dump(model, 'model.pkl')  
model_loaded = joblib.load('model.pkl')

model_loaded

Out:

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=100, tol=0.0001,
    verbose=0)

See how the n_clusters and random_state parameters are the same between the model and model_new objects? You're good to go.

Predict with the "new" model:

model_loaded.predict([0,0,0,0])

Out[64]: array([0])