Evalml: 범주형 또는 부울 열에 없음이 있는 경우 입력기를 맞출 수 없습니다.

에 만든 2020년 08월 19일 · 3코멘트 · 출처: alteryx/evalml

재생기

from evalml.pipelines.components import Imputer
df = pd.DataFrame({"a": [1, 2, 3], "b": ["1", "2", None]})
imputer = Imputer()
imputer.fit(df)

from evalml.pipelines.components import Imputer
df_with_bool = pd.DataFrame({"a": [1, 2, 3], "b": [True, False, None]})
imputer = Imputer()
imputer.fit(df_with_bool)

둘 다 동일한 스택 추적을 갖습니다.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-69-9af4cfc17aec> in <module>
      1 df_with_bool = pd.DataFrame({"a": [1, 2, 3], "b": [True, False, None]})
      2 imputer = Imputer()
----> 3 imputer.fit(df_with_bool)

~/sources/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/sources/evalml/evalml/pipelines/components/transformers/imputers/imputer.py in fit(self, X, y)
     76         X_categorical = X_null_dropped.select_dtypes(include=categorical_dtypes + boolean)
     77         if len(X_categorical.columns) > 0:
---> 78             self._categorical_imputer.fit(X_categorical, y)
     79             self._categorical_cols = X_categorical.columns
     80         return self

~/sources/evalml/evalml/utils/base_meta.py in _set_fit(self, X, y)
     12         @wraps(method)
     13         def _set_fit(self, X, y=None):
---> 14             return_value = method(self, X, y)
     15             self._is_fitted = True
     16             return return_value

~/sources/evalml/evalml/pipelines/components/transformers/imputers/simple_imputer.py in fit(self, X, y)
     42         if not isinstance(X, pd.DataFrame):
     43             X = pd.DataFrame(X)
---> 44         self._component_obj.fit(X, y)
     45         self._all_null_cols = set(X.columns) - set(X.dropna(axis=1, how='all').columns)
     46         return self

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in fit(self, X, y)
    300                                                     fill_value)
    301         else:
--> 302             self.statistics_ = self._dense_fit(X,
    303                                                self.strategy,
    304                                                self.missing_values,

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in _dense_fit(self, X, strategy, missing_values, fill_value)
    384                 row_mask = np.logical_not(row_mask).astype(np.bool)
    385                 row = row[row_mask]
--> 386                 most_frequent[i] = _most_frequent(row, np.nan, 0)
    387 
    388             return most_frequent

~/miniconda3/envs/evalml/lib/python3.8/site-packages/sklearn/impute/_base.py in _most_frequent(array, extra_value, n_repeat)
     40             # has already been NaN-masked.
     41             warnings.simplefilter("ignore", RuntimeWarning)
---> 42             mode = stats.mode(array)
     43 
     44         most_frequent_value = mode[0][0]

~/miniconda3/envs/evalml/lib/python3.8/site-packages/scipy/stats/stats.py in mode(a, axis, nan_policy)
    498     counts = np.zeros(a_view.shape[:-1], dtype=np.int)
    499     for ind in inds:
--> 500         modes[ind], counts[ind] = _mode1D(a_view[ind])
    501     newshape = list(a.shape)
    502     newshape[axis] = 1

~/miniconda3/envs/evalml/lib/python3.8/site-packages/scipy/stats/stats.py in _mode1D(a)
    485 
    486     def _mode1D(a):
--> 487         vals, cnts = np.unique(a, return_counts=True)
    488         return vals[cnts.argmax()], cnts.max()
    489 

<__array_function__ internals> in unique(*args, **kwargs)

~/miniconda3/envs/evalml/lib/python3.8/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
    259     ar = np.asanyarray(ar)
    260     if axis is None:
--> 261         ret = _unique1d(ar, return_index, return_inverse, return_counts)
    262         return _unpack_tuple(ret)
    263 

~/miniconda3/envs/evalml/lib/python3.8/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
    320         aux = ar[perm]
    321     else:
--> 322         ar.sort()
    323         aux = ar
    324     mask = np.empty(aux.shape, dtype=np.bool_)

TypeError: '<' not supported between instances of 'NoneType' and 'bool'

그것이이 작품 np.nan 대신 None

bug

출처

freddyaboulton

모든 3 댓글

@freddyaboulton 명확한 재생산에 감사드립니다! 이것은 다른 버그 #1092도 설명하는 것으로 보입니다.

문제
pandas 데이터 프레임의 기능에 object 유형이 있고 None 값이 포함되어 있으면 Imputer 가 실패합니다.

X = pd.DataFrame({'feature1': [False, True, None, np.nan]}) 는 object 유형의 피쳐를 생성합니다. Imputer.fit 실패.
X = pd.DataFrame({'feature1': [False, True, np.nan]}) 는 object 유형의 피쳐를 생성합니다. Imputer.fit 작동합니다.
X = pd.DataFrame({'feature1': [False, True]}) 는 bool 유형의 피쳐를 생성합니다. Imputer.fit 작동합니다.

category 유형도 마찬가지입니다. 마지막 경우는 적용되지 않지만 유사한 상황이 문자열 유형에 대해 발생합니다.

메모
여기서 혼란스러운 점은 None 가 다른 것을 의미할 수 있다는 것입니다. nan 와 같을 수도 있고 자체 범주로 사용할 수도 있습니다.

우리가 그 규칙을 문서화하고 설명하는 한 nan 로 취급하는 것이 좋다고 생각합니다.

해결 방법
부울 / 카테고리 / 문자열 기능에서 None 정리 : df = df.fillna(value=np.nan)

고치다
단기:

Imputer 을 업데이트하여 None 을 np.nan 로 대체
Imputer API 문서 및 automl 사용자 가이드를 업데이트하여 이에 대해 언급하세요.
모든 의도된 데이터 유형에 대해 데이터에 None 를 포함하여 Imputer 의 테스트 범위를 추가합니다.

데이터에 None 가 있으면 오류가 발생하는 DataCheck 를 대신 추가할 수 있습니다. 그러나 이것은 None 가 쉽게 변환될 수 있기 때문에 불필요하게 느껴집니다.

장기간:
새로운 DataTable 데이터 구조를 사용하도록 evalml을 업데이트하면 사용자는 미리 각 기능의 유형을 구성할 수 있습니다. 이것이 표준화가 이러한 종류의 오류를 무의미하게 만들 것이라는 것을 의미하기를 바랍니다.

dsherry 에 2020년 08월 27일

#540과 관련이 있습니까?

angela97lin 에 2020년 08월 27일

👀1 🚀1

@angela97lin 🤦 100% 관련... ㅋ. 우리는 imputer가 None s를 np.nan s로 변환하기로 결정했습니다.

여기에 기록이 더 최신이기 때문에 이에 찬성하여 #540을 닫습니다.

감사합니다!

dsherry 에 2020년 08월 27일

이 페이지가 도움이 되었나요?

0 / 5 - 0 등급

Evalml: 범주형 또는 부울 열에 없음이 있는 경우 입력기를 맞출 수 없습니다.

재생기

모든 3 댓글

관련 문제