Evalml: Have automl auto-fit the best pipeline on entire training data

Created on 11 Dec 2020  ·  3Comments  ·  Source: alteryx/evalml

Background
Today in order to get a trained pipeline out of automl you need to call fit on the pipeline, because automl always returns untrained copies of the pipeline:

automl.search(X_train, y_train)
best_pipeline = automl.best_pipeline
best_pipeline.fit(X_train, y_train)
best_pipeline.score(X_test, y_test, objectives=['MSE'])
pipeline = automl.get_pipeline(42)
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test, objectives=['MSE'])

Challenge
We'd like to make it as easy as possible for people to run automl, select a pipeline and use that pipeline to iterate and debug, to generate insights and to deploy to production.

Proposal
In the short-term (i.e. this issue): have the best_pipeline accessor return a trained pipeline:

automl.search(X_train, y_train)
best_pipeline = automl.best_pipeline
best_pipeline.score(X_test, y_test, objectives=['MSE'])

If automl search hasn't run yet, that accessor should error.

My recommendation for how to implement this is to update automl search to fit the best pipeline at the end and save a reference to that trained pipeline.

Don't forget to update the user guide!

There are also implications for perf testing: we should update looking glass to record the automl search time and the time to fit the best pipeline separately, because they're independent operations.

Future
Long-term, I'd like us to create an abstraction for holding a reference to the data outside of the call to search. This would allow us to do things like have get_pipeline return trained pipelines as well, without us having to train all the pipelines during the call to automl search.

enhancement

Most helpful comment

should we also add a flag train_best_pipeline to search() or __init__ to allow the user to turn this off? i think that would be nice since they could add significant extra time and memory for an action the user might not want to happen

All 3 comments

So the plan is to add arguments for X_test and y_test to the AutoMLSearch.search api ? Or it will fit on the X and y passed to search?

should we also add a flag train_best_pipeline to search() or __init__ to allow the user to turn this off? i think that would be nice since they could add significant extra time and memory for an action the user might not want to happen

@freddyaboulton I think it should fit on the entire training data which was provided to search

@kmax12 good point, agreed, we can add a train_best_pipeline flag, default True. In that case, TBD what the API should do if its False. My instinct would be to simply have best_pipeline return an untrained pipeline, but if anyone has a better idea I'm all ears. @bchen1116 FYI

Was this page helpful?
0 / 5 - 0 ratings