Evalml: Running AutoML on Iris Dataset Fails

Created on 23 Jul 2020 · 3Comments · Source: alteryx/evalml

Running evalml 0.11.2. It looks like the option to set data checks to False has been removed from AutoMLSearch, which was a work-around for this issue previously.

TypeError Traceback (most recent call last)
in
1 automl = AutoMLSearch(objective="log_loss_multi", max_pipelines=5, problem_type="multiclass")
2
----> 3 automl.search(X, y)

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\automl\automl_search.py in search(self, X, y, data_checks, feature_types, raise_errors, show_iteration_plot)
316
317 data_checks = self._validate_data_checks(data_checks)
--> 318 data_check_results = data_checks.validate(X, y)
319
320 if len(data_check_results) > 0:

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\data_checks.py in validate(self, X, y)
33 messages = []
34 for data_check in self.data_checks:
---> 35 messages_new = data_check.validate(X, y)
36 messages.extend(messages_new)
37 return messages

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\label_leakage_data_check.py in validate(self, X, y)
53 if len(X.columns) == 0:
54 return []
---> 55 corrs = {label: abs(y.corr(col)) for label, col in X.iteritems() if abs(y.corr(col)) >= self.pct_corr_threshold}
56
57 highly_corr_cols = {key: value for key, value in corrs.items() if value >= self.pct_corr_threshold}

~.conda\envs\evalml_test_1.0\lib\site-packages\evalml\data_checks\label_leakage_data_check.py in (.0)
53 if len(X.columns) == 0:
54 return []
---> 55 corrs = {label: abs(y.corr(col)) for label, col in X.iteritems() if abs(y.corr(col)) >= self.pct_corr_threshold}
56
57 highly_corr_cols = {key: value for key, value in corrs.items() if value >= self.pct_corr_threshold}

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\series.py in corr(self, other, method, min_periods)
2252 if method in ["pearson", "spearman", "kendall"] or callable(method):
2253 return nanops.nancorr(
-> 2254 this.values, other.values, method=method, min_periods=min_periods
2255 )
2256

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in _f(args, *kwargs)
67 try:
68 with np.errstate(invalid="ignore"):
---> 69 return f(args, *kwargs)
70 except ValueError as e:
71 # we want to transform an object array

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in nancorr(a, b, method, min_periods)
1238
1239 f = get_corr_func(method)
-> 1240 return f(a, b)
1241
1242

~.conda\envs\evalml_test_1.0\lib\site-packages\pandas\core\nanops.py in _pearson(a, b)
1254
1255 def _pearson(a, b):
-> 1256 return np.corrcoef(a, b)[0, 1]
1257
1258 def _kendall(a, b):

<__array_function__ internals> in corrcoef(args, *kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in corrcoef(x, y, rowvar, bias, ddof)
2524 warnings.warn('bias and ddof have no effect and are deprecated',
2525 DeprecationWarning, stacklevel=3)
-> 2526 c = cov(x, y, rowvar)
2527 try:
2528 d = diag(c)

<__array_function__ internals> in cov(args, *kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
2429 w *= aweights
2430
-> 2431 avg, w_sum = average(X, axis=1, weights=w, returned=True)
2432 w_sum = w_sum[0]
2433

<__array_function__ internals> in average(args, *kwargs)

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\lib\function_base.py in average(a, axis, weights, returned)
391
392 if weights is None:
--> 393 avg = a.mean(axis)
394 scl = avg.dtype.type(a.size/avg.size)
395 else:

~.conda\envs\evalml_test_1.0\lib\site-packages\numpy\core_methods.py in _mean(a, axis, dtype, out, keepdims)
152 if isinstance(ret, mu.ndarray):
153 ret = um.true_divide(
--> 154 ret, rcount, out=ret, casting='unsafe', subok=False)
155 if is_float16_result and out is None:
156 ret = arr.dtype.type(ret)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

It does something slightly different when run - the search executes instead of failing with a stack trace, but all scores for all pipelines are nan.

Optimizing for Log Loss Multiclass.
Lower score is better.

Searching up to 4 pipelines.
Allowed model families: random_forest, xgboost, linear_model, catboost

(1/4) Mode Baseline Multiclass Classificati... Elapsed:00:00
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(2/4) CatBoost Classifier w/ Simple Imputer Elapsed:00:00
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(3/4) XGBoost Classifier w/ Simple Imputer Elapsed:00:02
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan
(4/4) Random Forest Classifier w/ Simple Im... Elapsed:00:02
Starting cross validation
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Error in PipelineBase.score while scoring objective Log Loss Multiclass: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Finished cross validation - mean Log Loss Multiclass: nan

Search finished after 00:02
Best pipeline: Mode Baseline Multiclass Classification Pipeline
Best pipeline Log Loss Multiclass: nan
ToolId 3: AutoML tool done
Finished in 14.397 seconds

The pandas data types are the same in both environments.

sepal.length float64
sepal.width float64
petal.length float64
petal.width float64
class object
dtype: object

The Jupyter notebook is using Python 3.7.3 and the tool is 3.6.8.

bug

Source

SydneyAyx

All 3 comments

@SydneyAyx : yep, we changed the mechanism for disabling data checks in 0.11.2:

automl.search(..., data_checks=None, ...)

Making a note that we should add that to the user guide section.

Please give that a shot and if that still doesn't fix your issue let's talk again.

If that does fix the issue, I remember #828 was previously filed to track this. And we closed that in favor of #645 , which is currently in progress. However, I'm not sure #645 will actually fix the underlying problem. So let's keep it open.

dsherry on 23 Jul 2020

Ah, I got confused about the timeline: #932 was merged last week and fixes this issue! I just ran the reproducer I wrote in #828 to confirm this. The next release (0.12.0, next Tues) will include the fix.

I'll keep this open and close it when we put that release out.

dsherry on 23 Jul 2020

👀1 🎉1

Fixed in v0.12.0 which just went out!

dsherry on 3 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

AutoMLSearch: calling search twice on same instance doesn't work

angela97lin · 5Comments

Have automl auto-fit the best pipeline on entire training data

dsherry · 3Comments

Stacked ensembler: use same CV data splitter as the rest of automl

dsherry · 4Comments

Allow components which are not "leaf" children in the class hierarchy

angela97lin · 4Comments

AutoMLSearch get_pipeline always returns pipelines with the same name

freddyaboulton · 3Comments