Evalml: Re-enable binary classification threshold tuning by default

Created on 15 Apr 2020  ·  17Comments  ·  Source: alteryx/evalml

We added this feature on the #346 feature branch, then backed it out in #606 because it was recomputing predict and slowing down automl.

We should re-enable this by default. In order to do so, we'd have to cache the prediction output, which is currently computed in score. The long-term solution is to memoize predictions with a cache (#466), but short-term we should be able to do something.

This also relates to #579, which tracks cleaning up the duplicate code between the pipeline classes' score methods.

enhancement

All 17 comments

I'd like to take a crack at this next week. I've been researching a couple different methods for doing caching and have tested some stuff locally.

We shouldn't do this until we have a perf test MVP

Now that we have the perf tests MVP, we should do this! This came up as part of #1024.

@angela97lin thank you! Yes, definitely.

Next step is to generate a before vs after performance comparison on some of our binary classification problems.

Additional Considerations

  • Log loss (default objective for bin class) and AUC should not be changed at all by this, because they're threshold-agnostic. But other metrics like F1 should definitely improve. It would be nice to look at a few.
  • Fit time will take a hit. The question is, how bad of a hit? I'd expect no greater than a 10-20% increase.
  • We could experiment with sweeping the size of the threshold selection split. This could improve holdout accuracy by preventing overfitting/underfitting. Increasing the threshold tuning split size would also decrease the training split size, which leads to faster fit time

Future work

  • We don't currently have any safeguards for this around data size. This applies to the training set in general though, so we should file a separate issue.

In the original writeup in April, I said

we'd have to cache the prediction output, which is currently computed in score.

I believe that doesn't apply anymore, can ignore. That comment was left over from before we refactored score. Plus we do the threshold optimization on a separate split, so there's nothing to cache. @freddyaboulton FYI

@dsherry @angela97lin I put together the first few sections of the analysis doc here. Can you let me know what you think (only read up to the Experiments section - everything else is still a placeholder)?

@freddyaboulton I just left some comments. We should definitely look at log loss, which should show there's no change at least in the first batch. However I think we should also try optimizing for F1 or something else which is sensitive to threshold, so that we can see the effect of enabling tuning.

@freddyaboulton sorry, I got confused by the plots which were left over from the template, and I didn't see your comment about only reading the first part 🤦‍♂️ I like what you have

@freddyaboulton FYI since you posted a doc, I moved this issue to In Progress

@dsherry @angela97lin I finished my analysis on the "datasets_small_0.yaml" file.

In short, performance actually decreased after tuning the threshold - could it be because we are not using a stratified split to tune the threshold?

@freddyaboulton ooh, yes, that could be.

I reviewed your doc and left comments. I like the new charts and stats. We should find ways of adding those back into looking_glass/analysis/ so we can reuse them. Not pressing though.

Some options which come to mind off the top:

  • Use stratified split for the threshold optimization split
  • Enforce a minimum number of rows for the threshold optimization split. If this is unattainable, could warn and set no threshold, or could error
  • For smaller datasets, use the entire training data as the threshold optimization split, and risk overfitting

I think we should try switching to stratified sampling first and see what that does.

Another thing to try would be to switch the split size from 80% training 20% threshold optimization to 50% training 50% threshold optimization. I kinda doubt this would do well but its easy to try and would be interesting to see.

Since @jeremyliweishih is picking up #1049 , @freddyaboulton you may want to hand this off to him. I'll let you two figure that out :)

@freddyaboulton you're not working on this, right? Can @jeremyliweishih take it?

@jeremyliweishih @dsherry Please take it! The initial analysis showed that simply enabling tuning doesn't improve scores. Using a different data splitting strategy might help!

Moving back to Dev Backlog and will follow through with this after more data splitting work.

@bchen1116 and I discussed, and we feel this is necessary for #973

Was this page helpful?
0 / 5 - 0 ratings