Evalml: Imputer cannot fit when there is None in a categorical or boolean column

Created on 19 Aug 2020 · 3Comments · Source: alteryx/evalml

Reproducer

from evalml.pipelines.components import Imputer
df = pd.DataFrame({"a": [1, 2, 3], "b": ["1", "2", None]})
imputer = Imputer()
imputer.fit(df)

from evalml.pipelines.components import Imputer
df_with_bool = pd.DataFrame({"a": [1, 2, 3], "b": [True, False, None]})
imputer = Imputer()
imputer.fit(df_with_bool)

Both have the same stacktrace:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-9af4cfc17aec> in <module>
      1 df_with_bool = pd.DataFrame({"a": [1, 2, 3], "b": [True, False, None]})
      2 imputer = Imputer()
----> 3 imputer.fit(df_with_bool)

~/sources/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/sources/evalml/evalml/pipelines/components/transformers/imputers/imputer.py in fit(self, X, y)
     76         X_categorical = X_null_dropped.select_dtypes(include=categorical_dtypes + boolean)
     77         if len(X_categorical.columns) > 0:
---> 78             self._categorical_imputer.fit(X_categorical, y)
     79             self._categorical_cols = X_categorical.columns
     80         return self

~/sources/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/sources/evalml/evalml/pipelines/components/transformers/imputers/simple_imputer.py in fit(self, X, y)
     42         if not isinstance(X, pd.DataFrame):
     43             X = pd.DataFrame(X)
---> 44         self._component_obj.fit(X, y)
     45         self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns)
     46         return self

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in fit(self, X, y)
    300                                                     fill_value)
    301         else:
--> 302             self.statistics_ = self._dense_fit(X,
    303                                                self.strategy,
    304                                                self.missing_values,

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in _dense_fit(self, X, strategy, missing_values, fill_value)
    384                 row_mask = np.logical_not(row_mask).astype(np.bool)
    385                 row = row[row_mask]
--> 386                 most_frequent[i] = _most_frequent(row, np.nan, 0)
    387 
    388             return most_frequent

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in _most_frequent(array, extra_value, n_repeat)
     40             # has already been NaN-masked.
     41             warnings.simplefilter("ignore", RuntimeWarning)
---> 42             mode = stats.mode(array)
     43 
     44         most_frequent_value = mode[0][0]

~/miniconda3/envs/evalml/lib/python3.8/site-packages/scipy/stats/stats.py in mode(a, axis, nan_policy)
    498     counts = np.zeros(a_view.shape[:-1], dtype=np.int)
    499     for ind in inds:
--> 500         modes[ind], counts[ind] = _mode1D(a_view[ind])
    501     newshape = list(a.shape)
    502     newshape[axis] = 1

~/miniconda3/envs/evalml/lib/python3.8/site-packages/scipy/stats/stats.py in _mode1D(a)
    485 
    486     def _mode1D(a):
--> 487         vals, cnts = np.unique(a, return_counts=True)
    488         return vals[cnts.argmax()], cnts.max()
    489 

<__array_function__ internals> in unique(*args, **kwargs)

~/miniconda3/envs/evalml/lib/python3.8/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
    259     ar = np.asanyarray(ar)
    260     if axis is None:
--> 261         ret = _unique1d(ar, return_index, return_inverse, return_counts)
    262         return _unpack_tuple(ret)
    263 

~/miniconda3/envs/evalml/lib/python3.8/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
    320         aux = ar[perm]
    321     else:
--> 322         ar.sort()
    323         aux = ar
    324     mask = np.empty(aux.shape, dtype=np.bool_)

TypeError: '<' not supported between instances of 'NoneType' and 'bool'

This works when it is np.nan instead of None

bug

Source

freddyaboulton

All 3 comments

@freddyaboulton thanks for the clear reproducer! It appears this explains another bug #1092 as well.

Problem
If any feature in the pandas dataframe has object type and contains a None value, our Imputer fails.

X = pd.DataFrame({'feature1': [False, True, None, np.nan]}) creates a feature with object type. Imputer.fit fails.
X = pd.DataFrame({'feature1': [False, True, np.nan]}) creates a feature with object type. Imputer.fit works.
X = pd.DataFrame({'feature1': [False, True]}) creates a feature with bool type. Imputer.fit works.

The same is true for category type. A similar situation happens for string types, although the last case doesn't apply.

Notes
The confusing thing here is that None can mean different things. It could be the same as nan, or it could be intended as its own category.

I think its fine to treat it as nan as long as we document and explain that convention.

Workaround
Clean None out of bool/category/string features: df = df.fillna(value=np.nan)

Fix
Short-term:

Update Imputer to replace None with np.nan
Update Imputer API doc and automl user guide to mention this.
Add test coverage of Imputer with the inclusion of None in the data, for all intended datatypes.

We could instead add a DataCheck which errors if there are Nones in the data. But this feels unnecessary since Nones can be easily converted.

Long-term:
Once we update evalml to use the new DataTable datastructure, users will be able to configure the types of each feature ahead of time. I hope this means standardization will make these sorts of errors irrelevant.

dsherry on 27 Aug 2020

Is this related to #540?

angela97lin on 27 Aug 2020

👀1 🚀1

@angela97lin 🤦 100% related... in fact its a dup. Haha. We even decided there to have the imputer convert Nones to np.nans.

Closing #540 in favor of this because the writeups here are more up-to-date.

Thank you!

dsherry on 27 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

AutoML: use separate CV split for ensembling

angela97lin · 4Comments

Add a classification accuracy objective

dsherry · 4Comments

Have automl auto-fit the best pipeline on entire training data

dsherry · 3Comments

Update automl search "raise_errors" flag to default to true

dsherry · 4Comments

Update pipeline and components to return Woodwork data structures

angela97lin · 5Comments