Evalml: Re-enable binary classification threshold tuning by default

Created on 15 Apr 2020 · 17Comments · Source: alteryx/evalml

We added this feature on the #346 feature branch, then backed it out in #606 because it was recomputing predict and slowing down automl.

We should re-enable this by default. In order to do so, we'd have to cache the prediction output, which is currently computed in score. The long-term solution is to memoize predictions with a cache (#466), but short-term we should be able to do something.

This also relates to #579, which tracks cleaning up the duplicate code between the pipeline classes' score methods.

enhancement

Source

dsherry

All 17 comments

I'd like to take a crack at this next week. I've been researching a couple different methods for doing caching and have tested some stuff locally.

dsherry on 17 Apr 2020

We shouldn't do this until we have a perf test MVP

dsherry on 26 May 2020

Now that we have the perf tests MVP, we should do this! This came up as part of #1024.

angela97lin on 5 Aug 2020

👍1

@angela97lin thank you! Yes, definitely.

Next step is to generate a before vs after performance comparison on some of our binary classification problems.

dsherry on 6 Aug 2020

Additional Considerations

Log loss (default objective for bin class) and AUC should not be changed at all by this, because they're threshold-agnostic. But other metrics like F1 should definitely improve. It would be nice to look at a few.
Fit time will take a hit. The question is, how bad of a hit? I'd expect no greater than a 10-20% increase.
We could experiment with sweeping the size of the threshold selection split. This could improve holdout accuracy by preventing overfitting/underfitting. Increasing the threshold tuning split size would also decrease the training split size, which leads to faster fit time

Future work

We don't currently have any safeguards for this around data size. This applies to the training set in general though, so we should file a separate issue.

dsherry on 6 Aug 2020

In the original writeup in April, I said

we'd have to cache the prediction output, which is currently computed in score.

I believe that doesn't apply anymore, can ignore. That comment was left over from before we refactored score. Plus we do the threshold optimization on a separate split, so there's nothing to cache. @freddyaboulton FYI

dsherry on 19 Aug 2020

👍1

@dsherry @angela97lin I put together the first few sections of the analysis doc here. Can you let me know what you think (only read up to the Experiments section - everything else is still a placeholder)?

freddyaboulton on 19 Aug 2020

👍1

@freddyaboulton I just left some comments. We should definitely look at log loss, which should show there's no change at least in the first batch. However I think we should also try optimizing for F1 or something else which is sensitive to threshold, so that we can see the effect of enabling tuning.

dsherry on 19 Aug 2020

@freddyaboulton sorry, I got confused by the plots which were left over from the template, and I didn't see your comment about only reading the first part 🤦‍♂️ I like what you have

dsherry on 19 Aug 2020

@freddyaboulton FYI since you posted a doc, I moved this issue to In Progress

dsherry on 19 Aug 2020

👍1

@dsherry @angela97lin I finished my analysis on the "datasets_small_0.yaml" file.

In short, performance actually decreased after tuning the threshold - could it be because we are not using a stratified split to tune the threshold?

freddyaboulton on 22 Aug 2020

@freddyaboulton ooh, yes, that could be.

I reviewed your doc and left comments. I like the new charts and stats. We should find ways of adding those back into looking_glass/analysis/ so we can reuse them. Not pressing though.

Some options which come to mind off the top:

Use stratified split for the threshold optimization split
Enforce a minimum number of rows for the threshold optimization split. If this is unattainable, could warn and set no threshold, or could error
For smaller datasets, use the entire training data as the threshold optimization split, and risk overfitting

I think we should try switching to stratified sampling first and see what that does.

Another thing to try would be to switch the split size from 80% training 20% threshold optimization to 50% training 50% threshold optimization. I kinda doubt this would do well but its easy to try and would be interesting to see.

dsherry on 22 Aug 2020

Since @jeremyliweishih is picking up #1049 , @freddyaboulton you may want to hand this off to him. I'll let you two figure that out :)

dsherry on 27 Aug 2020

👍1

@freddyaboulton you're not working on this, right? Can @jeremyliweishih take it?

dsherry on 25 Sep 2020

@jeremyliweishih @dsherry Please take it! The initial analysis showed that simply enabling tuning doesn't improve scores. Using a different data splitting strategy might help!

freddyaboulton on 25 Sep 2020

👍1

Moving back to Dev Backlog and will follow through with this after more data splitting work.

jeremyliweishih on 29 Sep 2020

@bchen1116 and I discussed, and we feel this is necessary for #973

dsherry on 8 Mar 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Add partial dependence plot

dsherry · 3Comments

Warning messages in unit test: "invalid value encountered in double_scalars" and others

dsherry · 3Comments

SHAP test fails with the Elastic Net Classifier estimator

bchen1116 · 4Comments

AutoMLSearch get_pipeline always returns pipelines with the same name

freddyaboulton · 3Comments

Allow components which are not "leaf" children in the class hierarchy

angela97lin · 4Comments