if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers data) https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
if i calculate the mean of values of each row, will that be the centroid? and how do I plot resulting clusters in matlab.
OK, first of all, in the dataset, 1 row corresponds to a single example in the data, you have 440 rows, which means the dataset consists of 440 examples. Each column contains the values for that specific feature (or attribute as you call it), e.g. column 1 in your dataset contains the values for the feature
Channel
, column 2 the values for the featureRegion
and so on.K-Means
Now for K-Means Clustering, you need to specify the number of clusters (the K in K-Means). Say you want K=3 clusters, then the simplest way to initialise K-Means is to randomly choose 3 examples from your dataset (that is 3 rows, randomly drawn from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.
You can think of your centroids as 3 bins and you want to put every example from the dataset into the closest(usually measured by the Euclidean distance; check the function
norm
in Matlab) bin.After the first round of putting all examples into the closest bin, you recalculate the centroids by calculating the
mean
of all examples in their respective bins. You repeat the process of putting all the examples into the closest bin until no example in your dataset moves to another bin.Some Matlab starting points
You load the data by
X = load('path/to/the/dataset', '-ascii');
In your case
X
will be a440x8
matrix.You can calculate the Euclidean distance from an example to a centroid by
distance = norm(example - centroid1);
, where both,example
andcentroid1
have dimensionality1x8
.Recalculating the centroids would work as follows, suppose you have done 1 iteration of K-Means and have put all examples into their respective closest bin. Say
Bin1
now contains all examples that are closest tocentroid1
and thereforeBin1
has dimensionality127x8
, which means that 127 examples out of 440 are in this bin. To calculate the centroid position for the next iteration you can then docentroid1 = mean(Bin1);
. You would do similar things to your other bins.As for plotting, you have to note that your dataset contains 8 features, which means 8 dimensions and which is not visualisable. I'd suggest you create or look for a (dummy) dataset which only consists of 2 features and would therefore be visualisable by using Matlab's
plot()
function.