Evalml: For some objectives where baseline was 0, "pct better than baseline" is nan

Created on 20 Nov 2020  ·  9Comments  ·  Source: alteryx/evalml

{'F1': nan,
 'MCC Binary': nan,
 'Log Loss Binary': 93.29789549298991,
 'AUC': 58.36492736629537,
 'Precision': nan,
 'Balanced Accuracy Binary': 63.46659876071641,
 'Accuracy Binary': 12.876088314169193}

I've created a Jupyter notebook that reproduces this problem in evalml, and attached it and the associated datafile to a thread in Slack.

enhancement

Most helpful comment

Like. :-)

All 9 comments

Reproducer

import evalml
import pandas as pd
X = pd.read_csv('~/Downloads/fraud_500_data.csv').drop(['id', 'expiration_date'], axis=1)
y = X.pop('fraud')
automl = evalml.automl.AutoMLSearch(problem_type="binary", objective="f1")
automl.search(X, y)
# note that all percent_better_than_baseline values are nan in the rankings table
print(automl.rankings)
# can also check the scores of any pipeline other than the baseline pipeline, which should have id 0
print(automl.results['pipeline_results'][1]['percent_better_than_baseline_all_objectives'])

Dataset is here

@dsherry @rpeck This is expected behavior because the baseline pipeline gets a score of 0 on the objectives with NaN (F1, MCCBinary, Precision). There have been discussions about setting division by 0 to be either infinity or None in this method but we've never decided those are better than NaN because if the baseline scores the worst possible score on any objective, then comparing "percent better" on that objective doesn't do much good and that can be conveyed with None, NaN, or infinity.

That being said, there may be other reasons to pick one of these options over NaN!

@freddyaboulton Ah, makes sense! I'll change the test to skip over any objective where the baseline is 0. Thanks!

Thank you @freddyaboulton ! @rpeck sorry I didn't catch this when you were asking me about it yesterday.

Leaving this issue open to discuss: should we change the behavior in this case?

@freddyaboulton so F1, MCCBinary and Precision are all metrics where greater is better and are bounded in the range [-1, 1] (corr) or [0, 1]. Could we alter the pct improvement impl to compute the absolute difference from 0 and use that as the pct improvement? And if that's what we're doing currently, I wouldn't expect a baseline of 0 to produce nan pct improvement for those metrics.

@dsherry We proposed computing absolute difference for objectives bounded by [0, 1] in the design phase but we decided having two different computations would be confusing. That being said, we should maybe reconsider that given that the baseline pipeline is almost designed to score 0 on those objectives lol. Worth noting that when we first made that decision, we were only computing the percent better for the primary objective (which is not one of these bounded objectives except for regression).

Even if we go compute absolute difference, we may want to consider changing the Nan/None/inf division-by-0 behavior. One interesting case to consider is R2,since in most cases it's [0, 1] but it's technically (-inf, 1]. So computing absolute difference may not be mathematically sound but since it's the default objective for regression, we should expect to see lots of baselines scoring 0.

So to summarize, there are two independent changes we can make, leading to four possible outcomes:

  1. Do not compute absolute difference for objectives bounded in [0, 1], division by 0 is Nan. Current behavior.
  2. Do not compute absolute difference for objectives bounded in [0, 1], division by 0 is inf.
  3. Compute absolute differences for objectives bounded in [0, 1], division by 0 is Nan.
  4. Compute absolute differences for objectives bounded in [0, 1], division by 0 is inf.

Although I prefer returning NaN when we divide by 0, the gut reaction of users when they see NaN has been to suppose something broke in automl. I think returning inf would make it clearer that nothing broke and that the pipeline is in fact better than the baseline.

That leaves options 2 and 4.

I think having two different computations for "percent better" will make it harder to communicate to users what's actually being computed for each pipeline. That being said, our baseline pipelines are designed to score 0 for a lot of objectives (R2, F1, MCC) especially in imbalanced problems (we just predict the mode). That makes the "percent better" feature not very useful for most realistic problems since all pipelines will be "infinitely" better than the baseline.

I think I'm leaning 55% for option 4 and 45% for option 2 but I'd like to hear other viewpoints before making that change!

In standup today we decided its time to update the "pct better than baseline" behavior. We're going with options 2 and 4 above:

  • Use relative difference for objectives without bounds (MSE, log loss, etc)
  • Use absolute difference for objectives with [0, 1] bounds (AUC, R2, etc)
  • We'll have to handle edge cases like Pearson correlation ([-1, 1])
  • Return inf rather than nan if there's a divide-by-0 error

@freddyaboulton does this match what we discussed?

Like. :-)

Further: I agree with the decision. IMO, if a metric is [usually, at least] 0..1, then going from 0 to 0.2 _feels_ like a 20% improvement, even though mathematically it isn't. In a way, this reminds me of all of those formulas that take the log of a quantity, but they add 1 first so that they don't take the log of 0. :slightly_smiling_face:

Was this page helpful?
0 / 5 - 0 ratings