Scikit-learn: EM algorithm in GMM fails for one-dimensional datasets using 0.16.1 (but fine with 0.15.2)

Created on 13 May 2015 · 4Comments · Source: scikit-learn/scikit-learn

Fitting a one-dimensional gaussian distribution using GMM.fit() produces a Runtime error using scikit-learn version 0.16.1, but produces appropriate parameters using 0.15.2.

A short example to demonstrate the problem:

import sklearn
from sklearn import mixture
import numpy as np
from scipy import stats
import sys

# the version info 
print("Python version: %s.%s" %(sys.version_info.major, sys.version_info.minor))
print("scikit-learn version: %s" %(sklearn.__version__))

# some pretend data
np.random.seed(seed=0)
data = stats.norm.rvs(loc=100, scale=1, size=1000)
print("Data mean = %s, Data std dev = %s" %(np.mean(data), np.std(data)))

# Fitting using a GMM with a single component
clf = mixture.GMM(n_components=1)
clf.fit(data)
print(clf.means_, clf.weights_, clf.covars_)

Running this example code with scikit-learn 0.15.2 produces correct output:

Python version: 3.4
scikit-learn version: 0.15.2
Data mean = 99.9547432925, Data std dev = 0.987033158669
[[ 99.95474329]] [ 1.] [[ 0.97523446]]

However, exactly the same code using scikit-learn 0.16.1 gives this traceback:

Python version: 3.4
scikit-learn version: 0.16.1
Data mean = 99.9547432925, Data std dev = 0.987033158669
/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/numpy/lib/function_base.py:1890: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)
/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/numpy/lib/function_base.py:1901: RuntimeWarning: invalid value encountered in true_divide
  return (dot(X, X.T.conj()) / fact).squeeze()
Traceback (most recent call last):
  File "test_sklearn.py", line 18, in <module>
    clf.fit(data)
  File "/home/rebecca/anaconda/envs/new_sklearn/lib/python3.4/site-packages/sklearn/mixture/gmm.py", line 498, in fit
    "(or increasing n_init) or check for degenerate data.")
RuntimeError: EM algorithm was never able to compute a valid likelihood given initial parameters. Try different init parameters (or increasing n_init) or check for degenerate data.

I've tried various different values of the n_init, n_iter and covariance_type parameters. I've also tried a range of different datasets. All of these result in this error or similar using 0.16.1, but there are no issues at all using 0.15.2. The problem seems to be related to the initial parameters used in the expectation maximisation, so it's possible that this is related to this issue: #4429

In case this is useful info, I was using an anaconda virtual environment with a clean install of scikit-learn, set up as follows (for version 0.16.1):

conda create -n new_sklearn python=3.4
source activate new_sklearn
conda install sklearn

Bug

Source

rebeccaroisin

Most helpful comment

This likely an issue with the data shape.
Is your X 1 ndim or 2 ndim?
There might an unintentional behavior change between 0.15 and 0.16, but I think we decided that in the future, we will not support 1ndim input, so your input shape should be X.shape = (n_samples, 1).
You can do

X = X.reshape(-1, 1)

Otherwise it is somewhat ambiguous if you mean one sample or one feature.

amueller on 13 May 2015

👍2

All 4 comments

X = X.reshape(-1, 1)

Otherwise it is somewhat ambiguous if you mean one sample or one feature.

amueller on 13 May 2015

👍2

Yes, once I reshape the input data, it works fine. Thank you!

rebeccaroisin on 14 May 2015

Hi, the above code fixes the error, but the behavior is different I think. I am running this tutorial code but after I reshape the input data, I can not reproduce the plot.

Is there something I can change (parameters, maybe) so that I can have the same result? Thanks!

ghost on 15 Apr 2016

@imadie

in the tutorial referenced change following lines to :

clf = GMM(4, n_iter=500, random_state=3)
x.shape = (x.shape[0],1)
clf = clf.fit(x)

xpdf = np.linspace(-10, 20, 1000)
xpdf.shape = (xpdf.shape[0],1)
density = np.exp(clf.score(xpdf))

KelseyJustis on 31 May 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Suggestion: Remove prediction from plot_confusion_matrix and just pass predicted labels

jhennrich · 61Comments

t-SNE fails with array must not contain infs or NaNs (OSX specific)

joelkuiper · 108Comments

MSE is negative when returned by cross_val_score

tdomhan · 58Comments

RFC How should we control/expose number of threads for our OpenMP based parallel cython code ?

jeremiedbb · 58Comments

API for returning datasets as DataFrames

jnothman · 78Comments