Fixes #970 .
Per the discussion with @freddyaboulton in #929, it would be nice if we could pass along extra information to DataChecks. This would require updating the DataCheck API and considering how it interacts with AutoML, since we do not instantiate an instance of DataChecks and only pass along a DataChecks class as a parameter to search()
.
@angela97lin could you please describe the use-case for this?
@dsherry Sure! In #929, @freddyaboulton and I were discussing how it'd be nice if the InvalidTargetDataCheck
could be even more useful if it were aware of the type of problem it was handling. For example, if we knew that our problem was binary classification but the input to the data check had more than two classes, we could throw a warning/error. Hence, it'd be nice to be able to pass in parameters somehow or just more information to the data check. Unfortunately, this doesn't work well with our current design, where we pass around classes and not instances.
Alternatively, we could create data check classes for each problem type, such as BinaryClassificationInvalidTargetDataCheck but this could get pretty hairy too, when determining what DefaultDataChecks should include (or should this too be broken down to DefaultBinaryClassificationDataChecks?)
Just discussed with @angela97lin @freddyaboulton
We like the idea of mirroring the pattern we use for component_graph
in pipelines:
DataCheck
subclasses (or same but inside DataChecks
), not instancesDataChecks
classdata_check_parameters
dict, similar to our pipeline parameters
, which contains optional configuration for one or more data checks.DataChecks
directly they can follow a similar patterndata_check_parameters
should default to None
so people don't need to create that if its not required. But if a required arg is missing from a data check (like problem_type
for some) that should result in an initialization errorHere's a sketch of how this could look in automl search:
# today this helper standardizes the input to a list of `DataCheck` instances, and wraps that in a `DataChecks` instance
# after this work, this would standardize the input to a `DataChecks` class.
# if `data_checks` was already a `DataChecks` class, do nothing. else if `data_checks` is a list of `DataCheck` classes, define a `AutoMLDataChecks` class to wrap and return that
data_checks_class = self._validate_data_checks(data_checks)
# next we create the `DataChecks` instance by passing in data checks parameters
data_check_parameters = {'Target Datatype Data Check': {'problem_type': self.problem_type}}
data_checks = data_checks_class(data_check_parameters)
data_check_results = data_checks.validate(X, y)
Direct usage would look similar.
Next steps
TargetDatatype
data check (name TBD), based on our discussion on #960 . That data check would require a problem_type
parameter to be passed inTargetDatatype
issue at the same tim and build this! 🛠️ 😁 @dsherry The plan looks good to to me! The only thing I would add is that I prefer to augment the already existing InvalidTargetDataCheck
over creating a new data check but either approach would work for me. Whoever picks this up, please make sure to check that the target only has two unique values when the problem_type
is binary. This was mentioned in the review for #929.
if problem_type == "binary" and len(set(y)) != 2:
# Warn that we do not have two unique values in y
You know what, @angela97lin @freddyaboulton let's use this issue to track both a) updating automl and the data checks API to support parameterization and b) updating InvalidTargetDataCheck
to validate the target and raise intelligent errors, for all the target types we support.
Mentioning because I just filed a bug #970 and on closer look the issue would be fixed by the above. So this will close #970.
@dsherry How timely! That sounds good to me 😊
Most helpful comment
You know what, @angela97lin @freddyaboulton let's use this issue to track both a) updating automl and the data checks API to support parameterization and b) updating
InvalidTargetDataCheck
to validate the target and raise intelligent errors, for all the target types we support.Mentioning because I just filed a bug #970 and on closer look the issue would be fixed by the above. So this will close #970.