Scikit-learn: EM algorithm in GMM fails for one-dimensional datasets using 0.16.1 (but fine with 0.15.2)

Created on 13 May 2015  ·  4Comments  ·  Source: scikit-learn/scikit-learn

Fitting a one-dimensional gaussian distribution using GMM.fit() produces a Runtime error using scikit-learn version 0.16.1, but produces appropriate parameters using 0.15.2.

A short example to demonstrate the problem:

import sklearn
from sklearn import mixture
import numpy as np
from scipy import stats
import sys

# the version info 
print("Python version: %s.%s" %(sys.version_info.major, sys.version_info.minor))
print("scikit-learn version: %s" %(sklearn.__version__))

# some pretend data
np.random.seed(seed=0)
data = stats.norm.rvs(loc=100, scale=1, size=1000)
print("Data mean = %s, Data std dev = %s" %(np.mean(data), np.std(data)))

# Fitting using a GMM with a single component
clf = mixture.GMM(n_components=1)
clf.fit(data)
print(clf.means_, clf.weights_, clf.covars_)

Running this example code with scikit-learn 0.15.2 produces correct output:

Python version: 3.4
scikit-learn version: 0.15.2
Data mean = 99.9547432925, Data std dev = 0.987033158669
[[ 99.95474329]] [ 1.] [[ 0.97523446]]

However, exactly the same code using scikit-learn 0.16.1 gives this traceback:

Python version: 3.4
scikit-learn version: 0.16.1
Data mean = 99.9547432925, Data std dev = 0.987033158669
/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/numpy/lib/function_base.py:1890: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)
/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/numpy/lib/function_base.py:1901: RuntimeWarning: invalid value encountered in true_divide
  return (dot(X, X.T.conj()) / fact).squeeze()
Traceback (most recent call last):
  File "test_sklearn.py", line 18, in <module>
    clf.fit(data)
  File "/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/sklearn/mixture/gmm.py", line 498, in fit
    "(or increasing n_init) or check for degenerate data.")
RuntimeError: EM algorithm was never able to compute a valid likelihood given initial parameters. Try different init parameters (or increasing n_init) or check for degenerate data.

I've tried various different values of the n_init, n_iter and covariance_type parameters. I've also tried a range of different datasets. All of these result in this error or similar using 0.16.1, but there are no issues at all using 0.15.2. The problem seems to be related to the initial parameters used in the expectation maximisation, so it's possible that this is related to this issue: #4429

In case this is useful info, I was using an anaconda virtual environment with a clean install of scikit-learn, set up as follows (for version 0.16.1):

conda create -n new_sklearn python=3.4
source activate new_sklearn
conda install sklearn
Bug

Most helpful comment

This likely an issue with the data shape.
Is your X 1 ndim or 2 ndim?
There might an unintentional behavior change between 0.15 and 0.16, but I think we decided that in the future, we will not support 1ndim input, so your input shape should be X.shape = (n_samples, 1).
You can do

X = X.reshape(-1, 1)

Otherwise it is somewhat ambiguous if you mean one sample or one feature.

All 4 comments

This likely an issue with the data shape.
Is your X 1 ndim or 2 ndim?
There might an unintentional behavior change between 0.15 and 0.16, but I think we decided that in the future, we will not support 1ndim input, so your input shape should be X.shape = (n_samples, 1).
You can do

X = X.reshape(-1, 1)

Otherwise it is somewhat ambiguous if you mean one sample or one feature.

Yes, once I reshape the input data, it works fine. Thank you!

Hi, the above code fixes the error, but the behavior is different I think. I am running this tutorial code but after I reshape the input data, I can not reproduce the plot.

Is there something I can change (parameters, maybe) so that I can have the same result? Thanks!

@imadie

in the tutorial referenced change following lines to :

clf = GMM(4, n_iter=500, random_state=3)
x.shape = (x.shape[0],1)
clf = clf.fit(x)

xpdf = np.linspace(-10, 20, 1000)
xpdf.shape = (xpdf.shape[0],1)
density = np.exp(clf.score(xpdf))

Was this page helpful?
0 / 5 - 0 ratings