K means clustering for multidimensional data

2019-03-17 06:17发布

问题:

if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers data) https://archive.ics.uci.edu/ml/datasets/Wholesale+customers

if i calculate the mean of values of each row, will that be the centroid? and how do I plot resulting clusters in matlab.

回答1:

OK, first of all, in the dataset, 1 row corresponds to a single example in the data, you have 440 rows, which means the dataset consists of 440 examples. Each column contains the values for that specific feature (or attribute as you call it), e.g. column 1 in your dataset contains the values for the feature Channel, column 2 the values for the feature Region and so on.

K-Means

Now for K-Means Clustering, you need to specify the number of clusters (the K in K-Means). Say you want K=3 clusters, then the simplest way to initialise K-Means is to randomly choose 3 examples from your dataset (that is 3 rows, randomly drawn from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.

You can think of your centroids as 3 bins and you want to put every example from the dataset into the closest(usually measured by the Euclidean distance; check the function norm in Matlab) bin.

After the first round of putting all examples into the closest bin, you recalculate the centroids by calculating the mean of all examples in their respective bins. You repeat the process of putting all the examples into the closest bin until no example in your dataset moves to another bin.

Some Matlab starting points

You load the data by X = load('path/to/the/dataset', '-ascii');

In your case X will be a 440x8 matrix.

You can calculate the Euclidean distance from an example to a centroid by distance = norm(example - centroid1);, where both, example and centroid1 have dimensionality 1x8.

Recalculating the centroids would work as follows, suppose you have done 1 iteration of K-Means and have put all examples into their respective closest bin. Say Bin1 now contains all examples that are closest to centroid1 and therefore Bin1 has dimensionality 127x8, which means that 127 examples out of 440 are in this bin. To calculate the centroid position for the next iteration you can then do centroid1 = mean(Bin1);. You would do similar things to your other bins.

As for plotting, you have to note that your dataset contains 8 features, which means 8 dimensions and which is not visualisable. I'd suggest you create or look for a (dummy) dataset which only consists of 2 features and would therefore be visualisable by using Matlab's plot() function.