How can i find the mean distance from the centroid to all the data points in each cluster. I am able to find the euclidean distance of each point (in my dataset) from the centroid of each cluster. Now i want to find the mean distance from centroid to all the data points in each cluster. What is a good way of calculating mean distance from each centroid ? So far I have done this..
def k_means(self):
data = pd.read_csv('hdl_gps_APPLE_20111220_130416.csv', delimiter=',')
combined_data = data.iloc[0:, 0:4].dropna()
#print combined_data
array_convt = combined_data.values
#print array_convt
combined_data.head()
t_data=PCA(n_components=2).fit_transform(array_convt)
#print t_data
k_means=KMeans()
k_means.fit(t_data)
#------------k means fit predict method for testing purpose-----------------
clusters=k_means.fit_predict(t_data)
#print clusters.shape
cluster_0=np.where(clusters==0)
print cluster_0
X_cluster_0 = t_data[cluster_0]
#print X_cluster_0
distance = euclidean(X_cluster_0[0], k_means.cluster_centers_[0])
print distance
classified_data = k_means.labels_
#print ('all rows forst column........')
x_min = t_data[:, 0].min() - 5
x_max = t_data[:, 0].max() - 1
#print ('min is ')
#print x_min
#print ('max is ')
#print x_max
df_processed = data.copy()
df_processed['Cluster Class'] = pd.Series(classified_data, index=df_processed.index)
#print df_processed
y_min, y_max = t_data[:, 1].min(), t_data[:, 1].max() + 5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 1), np.arange(y_min, y_max, 1))
#print ('the mesh grid is: ')
#print xx
Z = k_means.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
#print Z
plt.plot(t_data[:, 0], t_data[:, 1], 'k.', markersize=20)
centroids = k_means.cluster_centers_
inert = k_means.inertia_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=8)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
In short I want to calculate mean distance of all the data points in particular cluster from the centroid of that cluster as I need to clean my data on the basis of this mean distance
You can use following Attribute of KMeans:
cluster_centers_ : array, [n_clusters, n_features]
For every point, test to what cluster it belongs using
predict(X)
and after that calculate distance to cluster predict returns(it returns index).alphaleonis gave nice answer. For the general case of n dimentions here is some a changes needed for his answer:
Here's one way. You can substitute another distance measure in the function for
k_mean_distance()
if you want another distance metric other than Euclidean.Calculate distance between data points for each assigned cluster and cluster centers and return the mean value.
Function for distance calculation:
And for each centroid, use the function to get the mean distance:
So, in the context of your question:
If you plot the results
plt.plot(c_mean_distances)
you should see something like this:Compute all the distance into a numpy array.
Then use
nparray.mean()
to get the mean.