Evalml: Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type

Created on 15 Jul 2020 · 6Comments · Source: alteryx/evalml

Fixes #970 .

Per the discussion with @freddyaboulton in #929, it would be nice if we could pass along extra information to DataChecks. This would require updating the DataCheck API and considering how it interacts with AutoML, since we do not instantiate an instance of DataChecks and only pass along a DataChecks class as a parameter to search().

enhancement

Source

angela97lin

Most helpful comment

You know what, @angela97lin @freddyaboulton let's use this issue to track both a) updating automl and the data checks API to support parameterization and b) updating InvalidTargetDataCheck to validate the target and raise intelligent errors, for all the target types we support.

Mentioning because I just filed a bug #970 and on closer look the issue would be fixed by the above. So this will close #970.

dsherry on 23 Jul 2020

👍2

All 6 comments

@angela97lin could you please describe the use-case for this?

dsherry on 22 Jul 2020

@dsherry Sure! In #929, @freddyaboulton and I were discussing how it'd be nice if the InvalidTargetDataCheck could be even more useful if it were aware of the type of problem it was handling. For example, if we knew that our problem was binary classification but the input to the data check had more than two classes, we could throw a warning/error. Hence, it'd be nice to be able to pass in parameters somehow or just more information to the data check. Unfortunately, this doesn't work well with our current design, where we pass around classes and not instances.

Alternatively, we could create data check classes for each problem type, such as BinaryClassificationInvalidTargetDataCheck but this could get pretty hairy too, when determining what DefaultDataChecks should include (or should this too be broken down to DefaultBinaryClassificationDataChecks?)

angela97lin on 23 Jul 2020

Just discussed with @angela97lin @freddyaboulton

We like the idea of mirroring the pattern we use for component_graph in pipelines:

The list of data checks can be specified initially to automl search as a list of DataCheck subclasses (or same but inside DataChecks), not instances
Once automl search wants to run the data checks, it can create an instance of the DataChecks class
At that point we'd pass it a data_check_parameters dict, similar to our pipeline parameters, which contains optional configuration for one or more data checks.
If users want to use DataChecks directly they can follow a similar pattern
data_check_parameters should default to None so people don't need to create that if its not required. But if a required arg is missing from a data check (like problem_type for some) that should result in an initialization error

Here's a sketch of how this could look in automl search:

# today this helper standardizes the input to a list of `DataCheck` instances, and wraps that in a `DataChecks` instance
# after this work, this would standardize the input to a `DataChecks` class.
# if `data_checks` was already a `DataChecks` class, do nothing. else if `data_checks` is a list of `DataCheck` classes, define a `AutoMLDataChecks` class to wrap and return that
data_checks_class = self._validate_data_checks(data_checks)
# next we create the `DataChecks` instance by passing in data checks parameters
data_check_parameters = {'Target Datatype Data Check': {'problem_type': self.problem_type}}
data_checks = data_checks_class(data_check_parameters)
data_check_results = data_checks.validate(X, y)

Direct usage would look similar.

Next steps

@angela97lin @freddyaboulton others review the above sketch and sanity-check it
@angela97lin will file an issue to track adding a TargetDatatype data check (name TBD), based on our discussion on #960 . That data check would require a problem_type parameter to be passed in
Whoever picks up this issue should also pick up that TargetDatatype issue at the same tim and build this! 🛠️ 😁

dsherry on 23 Jul 2020

👍2

@dsherry The plan looks good to to me! The only thing I would add is that I prefer to augment the already existing InvalidTargetDataCheck over creating a new data check but either approach would work for me. Whoever picks this up, please make sure to check that the target only has two unique values when the problem_type is binary. This was mentioned in the review for #929.

if problem_type == "binary" and len(set(y)) != 2:
    # Warn that we do not have two unique values in y

freddyaboulton on 23 Jul 2020

👍2

Mentioning because I just filed a bug #970 and on closer look the issue would be fixed by the above. So this will close #970.

dsherry on 23 Jul 2020

👍2

@dsherry How timely! That sounds good to me 😊

angela97lin on 23 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add partial dependence plot

dsherry · 3Comments

AutoML: use separate CV split for ensembling

angela97lin · 4Comments

Update pipeline and components to return Woodwork data structures

angela97lin · 5Comments

AutoMLSearch get_pipeline always returns pipelines with the same name

freddyaboulton · 3Comments

build_conda_pkg failing on main

dsherry · 3Comments