I am trying to create program that cluster documents using hierarchical agglomerative clustering, and the output of the program depends on cutting the dendrogram at such a level that I get maximum purity.
So following is the algorithm I am working on right now.
Create dedrogram for the documents in the dataset
purity = 0
final_clusters
for all the levels, lvl, in the dendrogram
clusters = cut dendrogram at lvl
new_purity = calculate_purity_of(clusters)
if new_purity > purity
purity = new_purity
final_clusters = clusters
according to this algorithm I get the clusters at which the purity calculated is highest at all the levels.
The problem is, when I cut the dendrogram at lowest level, every cluster contains only one document, which means it is 100% pure, therefore average purity of clusters is 1.0. But this is not the desired output. What I want is proper grouping of documents. Am I doing something wrong?
You are using a too simple measure.
Yes, the "optimal" solution with respect to purity is to only merge duplicate objects, so that each cluster remains pure by definition.
This is why optimizing a mathematical criterion often isn't the right approach to tackle a real data problem. Instead, you need to ask yourself the question: "what would be an interesting result", where interesting is not the same as optimal in a mathematical sense.
Sorry that I cannot give you a better answer - but I don't have your data.
IMHO, any abstract mathematical approach will suffer from the same fate. You need to have your data and user needs specify what to cluster, not some statistical number; so don't look in mathematics for the answer, but look at your data and your user needs.