Evalml: Data checks: JSON-friendly message fmt, include a type enum and affected column names

Created on 12 Nov 2020  ·  9Comments  ·  Source: alteryx/evalml

This will make it easy to programmatically identify which feature or features have a particular error or warning.

enhancement

Most helpful comment

Tweaking this with @angela97lin . Latest:

{
  "message": "Warning: too many null values present in column 'foobar'",
  "code": "TOO_MANY_NULLS",
  "data_check_name": "HighlyNullDataCheck",
  "level": "warning",
  "details": {
    "columns": ["foobar"]
  }
}

where the "details" key can hold any info the data check wants to return. And it'll be omitted if there's nothing to be included.

We need to update the following data checks to include the column(s) which failed the check: highly null, ID, and target leakage. Bonus if we also update the other data checks :)

This is basically what @BoopBoopBeepBoop proposed! ✨ @Cmancuso FYI

All 9 comments

Slight expansion: to enable other programmatic formatting & internationalization, it would be nice if EvalML returned the un-formatted information in addition to the message that it returns right now. Something like:

{
  "message": "Warning: too many null values present in column 'foobar'",
  "check": "NullCheck",
  "code": "TOO_MANY_NULLS",
  "detail": {
    "level": "warning",
    "columns": ["foobar"]
  }
}

This would make it possible to perform language independent formatting on the results, by tying the code back to internationalization strings. And having metadata available in the detail allows that information to be optionally formatted back into the messages that are presented in a structured fashion.

Additionally, separating the check from the actual code that is returned would let us design checks that can have sets of differing recommendations/errors/warnings being produced from a single check instance.

@BoopBoopBeepBoop thanks. I like that proposal.

Would you expect "code" to be an enum, or just a string defined somewhere inside the data check?

One tweak I may suggest is to keep the structure flat:

{
  "message": "Warning: too many null values present in column 'foobar'",
  "code": "TOO_MANY_NULLS",
  "level": "warning",
  "columns": ["foobar"]
}

Here we're talking in terms of JSON, but setting that aside I think each of these fields should be attached to the DataCheckWarning/DataCheckError object.

That's fair - I had added those elements there as in my head detail was "essentially a map". However, those aren't great examples, because they're probably present for all checks ...

Would there be any situations where our messages should include additional metadata that is essentially present on a "per-message" basis? Examples I've seen on this before would be calling out specific values as examples, formatting dynamic lower/upper bounds in messages, or similar. That is usually where you benefit from having a detail (or pick a different name) map

Would you expect "code" to be an enum, or just a string defined somewhere inside the data check?

I would just expect it to be stable - where it lives is 🤷

Would you expect "code" to be an enum, or just a string defined somewhere inside the data check?

I would just expect it to be stable - where it lives is 🤷

If it's all the same I think I'd prefer an enum. We're currently defining our own strings to determine which checks to run as part of the job, and it would be nice to have an enum consistent to EvalML to leverage there in stead of creating our own.

Alongside this and maybe toward what @BoopBoopBeepBoop was mentioning with the detail, it might be helpful to know which type of error was thrown. For example, in the ID column check, EvalML does a check on whether the column name has "id" in it or if the values are N% unique. We may want to filter out the former or represent them differently, so this level of granularity may be useful, either through more granular enums or in something similar to a detail structure described above.

Just discussed with @dsherry @freddyaboulton: I had put up https://github.com/alteryx/evalml/pull/1444 to update our data checks API to return a dictionary instead of a list of warnings and errors. With this issue in mind, I'll update my PR to return a DataCheckResults where errors and warnings are attributes, as well as add a to_json method for the class which will return the JSON formatted messages as @BoopBoopBeepBoop has suggested here, with minimal information--for now, just the message and level.

This issue will remain open to track adding more fields (such as the "code") to the JSON output.

Thanks @angela97lin !

So here's the latest proposal:

  • Update DataCheck.validate and DataChecks.validate to both return the following format:
{'errors': [...], 'warnings': [...]}

where each entry in the above is of the format

{
  "message": "Warning: too many null values present in column 'foobar'",
  "code": "TOO_MANY_NULLS",
  "level": "warning",
  "columns": ["foobar"]
}
  • Define a DataCheckMessageCode enum for populating the code field above.
  • Delete the DataCheckMessage, DataCheckError and DataCheckWarning classes in favor of the dict format above

Any objections / comments before we build this? @tyler3991 @Cmancuso @BoopBoopBeepBoop @angela97lin @freddyaboulton

(@angela97lin @freddyaboulton I started typing out a version of this proposal with a DataCheckResults class, and then decided we should simply return the JSON-ified dict from above, rather than defining another class. Functionally equivalent, but not using the class is simpler. All the results class would be doing is holding the same info, not adding other functionality to it. LMK if you think that's a bad idea.)

Talked to @angela97lin about this RE PR #1444. The above plan is unchanged, except she may choose to keep the DataCheckMessage classes around internally for convenience / to make it harder for people to define invalid message entries. But validate will return a JSON-able dict of dicts as defined above.

(@angela97lin if that doesn't match what we just discussed please correct me!)

Tweaking this with @angela97lin . Latest:

{
  "message": "Warning: too many null values present in column 'foobar'",
  "code": "TOO_MANY_NULLS",
  "data_check_name": "HighlyNullDataCheck",
  "level": "warning",
  "details": {
    "columns": ["foobar"]
  }
}

where the "details" key can hold any info the data check wants to return. And it'll be omitted if there's nothing to be included.

We need to update the following data checks to include the column(s) which failed the check: highly null, ID, and target leakage. Bonus if we also update the other data checks :)

This is basically what @BoopBoopBeepBoop proposed! ✨ @Cmancuso FYI

Was this page helpful?
0 / 5 - 0 ratings