How to cluster data with discrete binary attribute

2019-06-11 05:29发布

问题:

In my data, there are ten millions of binary attributes, But only some of them are informative, most of them are zeros.

Format is like as following:

data  attribute1 attribute2 attribute3 attribute4   .........
A          0          1           0         1       .........
B          1          0           1         0       .........
C          1          1           0         1       .........
D          1          1           0         0       .........

What is a smart way to cluster this? I know K-means clustering. But I don't think it's suitable in this case. Because the binary value makes distances less obvious. And it will suffer form the curse of high-dimensionality. Eeve if I cluster based on those few informative attribute, it's still to many attributes.

I think the decision tree is nice to cluster this data. But it's a Classification algorithm!

What can I do?

回答1:

Have you considered frequent itemset mining instead?

K-means definitely is a bad idea, but hierarchical clustering may work when using an appropriate distance function such as jaccard, hamming, dice, ...

Anyway, what is a cluster? The choice of algorithm needs to fit to the kind of cluster you want to find. On binary data, centroid-based methods such as k-means don't make sense, as centroids are not too meaningful.

If the data are "shopping cart" type of information, consider using frequent itemset mining, as it allows discovering overlapping subsets.