Matlab:K-means clustering

2020-01-31 04:04发布

问题:

I have a matrice of A(369x10) which I want to cluster in 19 clusters. I use this method

[idx ctrs]=kmeans(A,19)

which yields idx(369x1) and ctrs(19x10)

I get the point up to here.All my rows in A is clustered in 19 clusters.

Now I have an array B(49x10).I want to know where the rows of this B corresponds in the among given 19 clusters.

How is it possible in MATLAB?

Thank you in advance

回答1:

I can't think of a better way to do it than what you described. A built-in function would save one line, but I couldn't find one. Here's the code I would use:

[ids ctrs]=kmeans(A,19);
D = dist([testpoint;ctrs]); %testpoint is 1x10 and D will be 20x20
[distance testpointID] = min(D(1,2:end));


回答2:

The following is a a complete example on clustering:

%% generate sample data
K = 3;
numObservarations = 100;
dimensions = 3;
data = rand([numObservarations dimensions]);

%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
    'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')

%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);


%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K);     % init distances
for k=1:K
    %d = sum((x-y).^2).^0.5
    D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end

% find  for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);

% compare it with what you expect it to be
sum(clusterIndices == clustIDX)


回答3:

I don't know if I get your meaning right, but if you want to know which cluster your points belong you can use KnnSearch function easily. It has two arguments and will search in first argument for the first one of them that is closest to argument two.



回答4:

Assuming you're using squared euclidean distance metric, try this:

for i = 1:size(ctrs,2)
d(:,i) = sum((B-ctrs(repmat(i,size(B,1),1),:)).^2,2);
end
[distances,predicted] = min(d,[],2)

predicted should then contain the index of the closest centroid, and distances should contain the distances to the closest centroid.

Take a look inside the kmeans function, at the subfunction 'distfun'. This shows you how to do the above, and also contains the equivalents for other distance metrics.



回答5:

for small amount of data, you could do

[testpointID,dum] = find(permute(all(bsxfun(@eq,B,permute(ctrs,[3,2,1])),2),[3,1,2]))

but this is somewhat obscure; the bsxfun with the permuted ctrs creates a 49 x 10 x 19 array of booleans, which is then 'all-ed' across the second dimension, permuted back and then the row ids are found. again, probably not practical for large amounts of data.