I am using scikit-learn to implement the Dirichlet Process Gaussian Mixture Model:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/dpgmm.py http://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html
That is, it is sklearn.mixture.BayesianGaussianMixture()
with default set to weight_concentration_prior_type = 'dirichlet_process'
. As opposed to k-means, where users set the number of clusters "k" a priori, DPGMM is an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters.
My DPGMM model consistently outputs the exact number of clusters as n_components
. As discussed here, the correct way to deal with this is to "reduce redundant components" with predict(X)
:
Scikit-Learn's DPGMM fitting: number of components?
However, the example linked to does not actually remove redundant components and show the "correct" number of clusters in the data. Rather, it simply plots the correct number of clusters.
http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html
How do users actually remove the redundant components, and output an array which should these components? Is this the "official"/only way to remove redundant clusters?
Here is my code:
>>> import pandas as pd
>>> import numpy as np
>>> import random
>>> from sklearn import mixture
>>> X = pd.read_csv(....) # my matrix
>>> X.shape
(20000, 48)
>>> dpgmm3 = mixture.BayesianGaussianMixture(n_components = 20, weight_concentration_prior_type='dirichlet_process', max_iter = 1000, verbose = 2)
>>> dpgmm3.fit(X) # Fitting the DPGMM model
>>> labels = dpgmm3.predict(X) # Generating labels after model is fitted
>>> max(labels)
>>> np.unique(labels) #Number of lab els == n_components specified above
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19])
#Trying with a different n_components
>>> dpgmm3_1 = mixture.BayesianGaussianMixture( weight_concentration_prior_type='dirichlet_process', max_iter = 1000) #not specifying n_components
>>> dpgmm3_1.fit(X)
>>> labels_1 = dpgmm3_1.predict(X)
>>> labels_1
array([0, 0, 0, ..., 0, 0, 0]) #All were classified under the same label
#Trying with n_components = 7
>>> dpgmm3_2 = mixture.BayesianGaussianMixture(n_components = 7, weight_concentration_prior_type='dirichlet_process', max_iter = 1000)
>>> dpgmm3_2.fit()
>>> labels_2 = dpgmm3_2.predict(X)
>>> np.unique(labels_2)
array([0, 1, 2, 3, 4, 5, 6]) #number of labels == n_components