We added this feature on the #346 feature branch, then backed it out in #606 because it was recomputing predict
and slowing down automl.
We should re-enable this by default. In order to do so, we'd have to cache the prediction output, which is currently computed in score. The long-term solution is to memoize predictions with a cache (#466), but short-term we should be able to do something.
This also relates to #579, which tracks cleaning up the duplicate code between the pipeline classes' score
methods.
I'd like to take a crack at this next week. I've been researching a couple different methods for doing caching and have tested some stuff locally.
We shouldn't do this until we have a perf test MVP
Now that we have the perf tests MVP, we should do this! This came up as part of #1024.
@angela97lin thank you! Yes, definitely.
Next step is to generate a before vs after performance comparison on some of our binary classification problems.
Additional Considerations
Future work
In the original writeup in April, I said
we'd have to cache the prediction output, which is currently computed in score.
I believe that doesn't apply anymore, can ignore. That comment was left over from before we refactored score
. Plus we do the threshold optimization on a separate split, so there's nothing to cache. @freddyaboulton FYI
@dsherry @angela97lin I put together the first few sections of the analysis doc here. Can you let me know what you think (only read up to the Experiments section - everything else is still a placeholder)?
@freddyaboulton I just left some comments. We should definitely look at log loss, which should show there's no change at least in the first batch. However I think we should also try optimizing for F1 or something else which is sensitive to threshold, so that we can see the effect of enabling tuning.
@freddyaboulton sorry, I got confused by the plots which were left over from the template, and I didn't see your comment about only reading the first part 🤦♂️ I like what you have
@freddyaboulton FYI since you posted a doc, I moved this issue to In Progress
@dsherry @angela97lin I finished my analysis on the "datasets_small_0.yaml" file.
In short, performance actually decreased after tuning the threshold - could it be because we are not using a stratified split to tune the threshold?
@freddyaboulton ooh, yes, that could be.
I reviewed your doc and left comments. I like the new charts and stats. We should find ways of adding those back into looking_glass/analysis/
so we can reuse them. Not pressing though.
Some options which come to mind off the top:
I think we should try switching to stratified sampling first and see what that does.
Another thing to try would be to switch the split size from 80% training 20% threshold optimization to 50% training 50% threshold optimization. I kinda doubt this would do well but its easy to try and would be interesting to see.
Since @jeremyliweishih is picking up #1049 , @freddyaboulton you may want to hand this off to him. I'll let you two figure that out :)
@freddyaboulton you're not working on this, right? Can @jeremyliweishih take it?
@jeremyliweishih @dsherry Please take it! The initial analysis showed that simply enabling tuning doesn't improve scores. Using a different data splitting strategy might help!
Moving back to Dev Backlog and will follow through with this after more data splitting work.
@bchen1116 and I discussed, and we feel this is necessary for #973