Evalml: Support parameterization of data checks; have InvalidTargetDataCheck validate target using problem_type

Created on 15 Jul 2020  ·  6Comments  ·  Source: alteryx/evalml

Fixes #970 .

Per the discussion with @freddyaboulton in #929, it would be nice if we could pass along extra information to DataChecks. This would require updating the DataCheck API and considering how it interacts with AutoML, since we do not instantiate an instance of DataChecks and only pass along a DataChecks class as a parameter to search().

enhancement

Most helpful comment

You know what, @angela97lin @freddyaboulton let's use this issue to track both a) updating automl and the data checks API to support parameterization and b) updating InvalidTargetDataCheck to validate the target and raise intelligent errors, for all the target types we support.

Mentioning because I just filed a bug #970 and on closer look the issue would be fixed by the above. So this will close #970.

All 6 comments

@angela97lin could you please describe the use-case for this?

@dsherry Sure! In #929, @freddyaboulton and I were discussing how it'd be nice if the InvalidTargetDataCheck could be even more useful if it were aware of the type of problem it was handling. For example, if we knew that our problem was binary classification but the input to the data check had more than two classes, we could throw a warning/error. Hence, it'd be nice to be able to pass in parameters somehow or just more information to the data check. Unfortunately, this doesn't work well with our current design, where we pass around classes and not instances.

Alternatively, we could create data check classes for each problem type, such as BinaryClassificationInvalidTargetDataCheck but this could get pretty hairy too, when determining what DefaultDataChecks should include (or should this too be broken down to DefaultBinaryClassificationDataChecks?)

Just discussed with @angela97lin @freddyaboulton

We like the idea of mirroring the pattern we use for component_graph in pipelines:

  • The list of data checks can be specified initially to automl search as a list of DataCheck subclasses (or same but inside DataChecks), not instances
  • Once automl search wants to run the data checks, it can create an instance of the DataChecks class
  • At that point we'd pass it a data_check_parameters dict, similar to our pipeline parameters, which contains optional configuration for one or more data checks.
  • If users want to use DataChecks directly they can follow a similar pattern
  • data_check_parameters should default to None so people don't need to create that if its not required. But if a required arg is missing from a data check (like problem_type for some) that should result in an initialization error

Here's a sketch of how this could look in automl search:

# today this helper standardizes the input to a list of `DataCheck` instances, and wraps that in a `DataChecks` instance
# after this work, this would standardize the input to a `DataChecks` class.
# if `data_checks` was already a `DataChecks` class, do nothing. else if `data_checks` is a list of `DataCheck` classes, define a `AutoMLDataChecks` class to wrap and return that
data_checks_class = self._validate_data_checks(data_checks)
# next we create the `DataChecks` instance by passing in data checks parameters
data_check_parameters = {'Target Datatype Data Check': {'problem_type': self.problem_type}}
data_checks = data_checks_class(data_check_parameters)
data_check_results = data_checks.validate(X, y)

Direct usage would look similar.

Next steps

  • @angela97lin @freddyaboulton others review the above sketch and sanity-check it
  • @angela97lin will file an issue to track adding a TargetDatatype data check (name TBD), based on our discussion on #960 . That data check would require a problem_type parameter to be passed in
  • Whoever picks up this issue should also pick up that TargetDatatype issue at the same tim and build this! 🛠️ 😁

@dsherry The plan looks good to to me! The only thing I would add is that I prefer to augment the already existing InvalidTargetDataCheck over creating a new data check but either approach would work for me. Whoever picks this up, please make sure to check that the target only has two unique values when the problem_type is binary. This was mentioned in the review for #929.

if problem_type == "binary" and len(set(y)) != 2:
    # Warn that we do not have two unique values in y

You know what, @angela97lin @freddyaboulton let's use this issue to track both a) updating automl and the data checks API to support parameterization and b) updating InvalidTargetDataCheck to validate the target and raise intelligent errors, for all the target types we support.

Mentioning because I just filed a bug #970 and on closer look the issue would be fixed by the above. So this will close #970.

@dsherry How timely! That sounds good to me 😊

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dsherry picture dsherry  ·  3Comments

angela97lin picture angela97lin  ·  4Comments

angela97lin picture angela97lin  ·  5Comments

freddyaboulton picture freddyaboulton  ·  3Comments

dsherry picture dsherry  ·  3Comments