The results I get from DPGMM are not what I expect. E.g.:
>>> import sklearn.mixture
>>> sklearn.__version__
'0.12-git'
>>> data = [[1.1],[0.9],[1.0],[1.2],[1.0], [6.0],[6.1],[6.1]]
>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=1)
>>> m.fit(data)
DPGMM(alpha=1, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=1000, params='wmc',
random_state=<mtrand.RandomState object at 0x108a3f168>, thresh=0.01,
verbose=False)
>>> m.converged_
True
>>> m.weights_
array([ 0.2, 0.2, 0.2, 0.2, 0.2])
>>> m.means_
array([[ 0.62019109],
[ 1.16867356],
[ 0.55713292],
[ 0.36860511],
[ 0.17886128]])
I expected the result to be more similar to the vanilla GMM; that is, two gaussians (around values 1 and 6), with non-uniform weights (like [ 0.625, 0.375]). I expected the "unused" gaussians to have weights near zero.
Am I using the model incorrectly?
I've also tried changing alpha without any luck.
Not a big difference with version 0.14.1 of sklearn. I will use following code for printing DPGMM model:
def pprint(model, data):
idx = np.unique(model.predict(data))
m_w_cov = [model.means_, model.weights_, model._get_covars()]
flattened = map(lambda x: np.array(x).flatten(), m_w_cov)
filtered = map(lambda x: x[idx], flattened)
print np.array(filtered)
This function filters out redundand (empty) components, i.e. those are not used in predict, and print means, weights and covariations.
If one make several tries with data from OP question, one can find two different results:
>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=1).fit(data)
>>> m.predict(data)
array([0, 0, 0, 0, 0, 1, 1, 1])
>>> pprint(m, data)
[[ 0.62019109 1.16867356]
[ 0.10658447 0.19810279]
[ 1.08287064 12.43049771]]
and
>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=1).fit(data)
>>> m.predict(data)
array([1, 1, 1, 0, 1, 0, 0, 0])
>>> pprint(m, data)
[[ 1.24122696 0.64252404]
[ 0.17157736 0.17416976]
[ 11.51813929 1.07829109]]
then one can guess that unexpected result causes lie in the fact that some of intermediate results (1.2 in our case) migrate between classes, and method is unable to infer correct model paramethers. One reason is that clustering paramether, alpha is too big for our clusters, containing only 3 elements each, we can try better by reducing this paramether, 0.1
will give more stable results:
>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=.1).fit(data)
>>> m.predict(data)
array([1, 1, 1, 1, 1, 0, 0, 0])
But the root cause lies in stohastic nature of DPGMM method, method is unabile to infer model structure in case of small clusters. Things become better, and method behave more as expected, if we extend observations 4 times:
>>> m = sklearn.mixture.DPGMM(n_components=5, n_iter=1000, alpha=1).fit(data*4)
>>> pprint(m, data)
[[ 0.90400296 5.46990901]
[ 0.11166431 0.24956023]
[ 1.02250372 1.31278926]]
In conclusion, be careful with method fitting paramethers, and aware of fact that some ML methods do not work well in case of small or skewed datasets.