Background
Today in order to get a trained pipeline out of automl you need to call fit
on the pipeline, because automl always returns untrained copies of the pipeline:
automl.search(X_train, y_train)
best_pipeline = automl.best_pipeline
best_pipeline.fit(X_train, y_train)
best_pipeline.score(X_test, y_test, objectives=['MSE'])
pipeline = automl.get_pipeline(42)
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test, objectives=['MSE'])
Challenge
We'd like to make it as easy as possible for people to run automl, select a pipeline and use that pipeline to iterate and debug, to generate insights and to deploy to production.
Proposal
In the short-term (i.e. this issue): have the best_pipeline
accessor return a trained pipeline:
automl.search(X_train, y_train)
best_pipeline = automl.best_pipeline
best_pipeline.score(X_test, y_test, objectives=['MSE'])
If automl search hasn't run yet, that accessor should error.
My recommendation for how to implement this is to update automl search to fit the best pipeline at the end and save a reference to that trained pipeline.
Don't forget to update the user guide!
There are also implications for perf testing: we should update looking glass to record the automl search time and the time to fit the best pipeline separately, because they're independent operations.
Future
Long-term, I'd like us to create an abstraction for holding a reference to the data outside of the call to search
. This would allow us to do things like have get_pipeline
return trained pipelines as well, without us having to train all the pipelines during the call to automl search
.
So the plan is to add arguments for X_test
and y_test
to the AutoMLSearch.search
api ? Or it will fit on the X
and y
passed to search
?
should we also add a flag train_best_pipeline
to search()
or __init__
to allow the user to turn this off? i think that would be nice since they could add significant extra time and memory for an action the user might not want to happen
@freddyaboulton I think it should fit on the entire training data which was provided to search
@kmax12 good point, agreed, we can add a train_best_pipeline
flag, default True. In that case, TBD what the API should do if its False. My instinct would be to simply have best_pipeline
return an untrained pipeline, but if anyone has a better idea I'm all ears. @bchen1116 FYI
Most helpful comment
should we also add a flag
train_best_pipeline
tosearch()
or__init__
to allow the user to turn this off? i think that would be nice since they could add significant extra time and memory for an action the user might not want to happen