Plot KMeans clusters and classification for 1-dime

2019-06-03 22:07发布

I am using KMeans to cluster the three time-series datasets with different characterstics. For reproducibility reasons, I am sharing the data here.

Here is my code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

protocols = {}

types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }



k_means = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)
k_means.fit(quotient.reshape(-1,1))

This way, given a new data point (with quotient and quotient_times), I want to know which cluster it belongs to by building each dataset stacking these two transformed features quotient and quotient_times with KMeans.

k_means.labels_ gives this output array([1, 1, 0, 1, 2, 1, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int32)

Finally, I want to visualize the clusters using plt.plot(k_means, ".",color="blue") but I am getting this error: TypeError: float() argument must be a string or a number, not 'KMeans'. How do we plot KMeans clusters?

2条回答
干净又极端
2楼-- · 2019-06-03 23:00

If I understand correctly what you want to plot is the boundary decision of your Kmeans result. You can find an example of how to do it in scikit-lean website here.

The above example is even doing PCA so the data can be visualize in 2D (if your data dimension is higher than 2) for you it's irrelevant.

You can easily plot your scatter points color by the Kmeans decision so you can better understand where your clustering went wrong.

查看更多
别忘想泡老子
3楼-- · 2019-06-03 23:08

What you're effectively looking for is a range of values between which points are considered to be in a given class. It's quite unusual to use KMeans to classify 1d data in this way, although it certainly works. As you've noticed you need to convert your input data to a 2d array in order to use the method.

k_means = KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

quotient_2d = quotient.reshape(-1,1)
k_means.fit(quotient_2d)

You will need the quotient_2d again for the classification (prediction) step later.

First we can plot the centroids, since the data is 1d the x-axis point is arbitrary.

colors = ['r','g','b']
centroids = k_means.cluster_centers_
for n, y in enumerate(centroids):
    plt.plot(1, y, marker='x', color=colors[n], ms=10)
plt.title('Kmeans cluster centroids')

This produces the following plot.

cluster centroids

To get cluster membership for the points, pass quotient_2d to .predict. This returns an array of numbers for class membership, e.g.

>>> Z = k_means.predict(quotient_2d)
>>> Z
array([1, 1, 0, 1, 2, 1, 0, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0], dtype=int32)

We can use this to filter our original data, plotting each class in a separate color.

# Plot each class as a separate colour
n_clusters = 3 
for n in range(n_clusters):
    # Filter data points to plot each in turn.
    ys = quotient[ Z==n ]
    xs = quotient_times[ Z==n ]

    plt.scatter(xs, ys, color=colors[n])

plt.title("Points by cluster")

This generates the following plot with the original data, each point coloured by the cluster membership.

points coloured by cluster

查看更多
登录 后发表回答