I am trying to create program that cluster documents using hierarchical agglomerative clustering, and the output of the program depends on cutting the dendrogram at such a level that I get maximum purity.
So following is the algorithm I am working on right now.
Create dedrogram for the documents in the dataset
purity = 0
final_clusters
for all the levels, lvl, in the dendrogram
clusters = cut dendrogram at lvl
new_purity = calculate_purity_of(clusters)
if new_purity > purity
purity = new_purity
final_clusters = clusters
according to this algorithm I get the clusters at which the purity calculated is highest at all the levels.
The problem is, when I cut the dendrogram at lowest level, every cluster contains only one document, which means it is 100% pure, therefore average purity of clusters is 1.0. But this is not the desired output. What I want is proper grouping of documents. Am I doing something wrong?