Cut off point in k-means clustering in sas

2019-08-15 00:42发布

问题:

So I want to classify my data into clusters with cut-off point in SAS. The method I use is k-means clustering. (I don't mind about the method, as long as, it gives me 3 groups.)

My code for clustering:

proc fastclus data=maindat outseed=seeds1 maxcluster =3 maxiter=0;
var value resid;
run;

I have the problem with the output result. I want the cut-off point for the Value to be include in the output file. (I don't want the cut-off point for Resid). So is there anyway to do this in SAS?

Edit: As Joe point out, I can't achieve what i'm looking for by using k-mean clustering. So is there another way? Basically, I want a cut-off point so that I can apply it to the another data set.

What I have:

Cluster  Value      Resid
 1        34        11.7668
 2        38.9      0.5328
 3        42.625    -13.2364

what I want:

Cluster  Value      Resid       Cut-off Value (Interger)
 1        34        11.7668     1-36
 2        38.9      0.5328      36-40
 3        42.625    -13.2364    40-44

My data:

data maindat;
input  value Resid ;
datalines;
44  -4.300511714
44  -9.646920963
44  -15.86956805
43  -16.14857235
43  -13.05797186
43  -13.80941206
42  -3.521394503
42  -1.102526302
42  -0.137573583
42  2.669238665
42  -9.540489193
42  -19.27474303
42  -3.527077011
41  1.676464068
41  -2.238822314
41  4.663079037
41  -5.346920963
40  -8.543723186
40  0.507460641
40  0.995302284
40  0.464194011
39  4.728791571
39  5.578685423
38  2.771297564
38  7.109159247
37  15.96059456
37  2.985292226
36  -4.301136971
35  5.854674875
35  5.797294021
34  4.393329025
33  -6.622580905
32  0.268500302
27  12.23062252
;
run;

回答1:

I don't think you could necessarily do this completely.

k-means clustering uses euclidean distance between all of the variables you provide it. This means that it's not solely using value to cluster observations: it's using Resid as well.

As such, it's possible a row with a value that seems like it should go with cluster 2 should actually go with cluster 3, if the Resid value is much closer there.

In your example, if you request an out dataset, you will see this is true. A proc freq of that out dataset reveals that cluster 1 has three rows, with values 27, 37, and 38. Cluster 2 has almost all of the rows - all but 7 in total - ranging from 32 to 44. Cluster 3 ranges from 40 to 44.

As such, there's no reasonable way to define your clusters the way you ask with this method of clustering. Clusters are typically defined by their centroid, and that's what you get with the outstat dataset; you can determine which cluster a particular value should be assigned based on this.