#1732 updates our stacked ensembling pipeline to use the same data splitter that is used in AutoML
. However, @rpeck noted that we may not want to do this. We continued with #1732 because we believed that it was still an improvement on our current approach (scikit-learn default).
This issue tracks long-term updates we may want to make to our data splitter for stacking in AutoML.
Update: while continuing to update #1732, we ran into a conundrum with the interaction between stacking and AutoML that made us revisit whether it really was a good idea to use the same data split for the stacking ensemble as we use in AutoML. We decided no, and that as Raymond had pointed out, we probably want to use a separate CV for our ensemble. (@dsherry also mentioned a good nugget of info that using ensembling allows the model to put less excessive importance on the more complex models, so CV probably helps with that--please correct me if I've paraphrased incorrectly 😂 ).
Rather than continuing that work then, we should use this issue to discuss updating AutoML for stacking: Specifically, we should create a separate split for CV for the stacked ensembling. This would be similar to what we currently have in place for binary-threshold tuning.
Plan: if stacking is enabled, we'll create a separate split which can be passed to stacked ensembling for CV
It could be neat to look into supporting using the out-of-sample predictions (validation split from original CV) as the data passed to stacking. However I suggest we start with the simpler approach of just creating a separate split if stacking is enabled.
RE our discussion, some supporting evidence for why we should withhold a separate split which stacked ensembling can use to perform CV:
@rpeck FYI, after some spinning we are following your suggestion 😆
@dsherry @rpeck @angela97lin I started looking at this issue, but it seems like sklearn's StackedClassifier
and StackedRegressor
classes do use internal cross validation during the training of the model in order to prevent overfitting. This seems to be the same problem that we are trying to solve with this issue, so it seems like it should be resolved. I don't think we'll need to do a separate CV fold for training/validating the Stacked Ensembling methods, but what do you all think?
After discussion with @dsherry, here's the idea we want to proceed with
Plan discussed with @bchen1116 : this issue tracks:
Separate performance enhancement: better support for small data. Don't create separate split for ensemling. Use out-of-sample pipeline preds from normal CV (from across all the CV folds) to train the ensembler. #1898
Another separate performance enhancement: train the pipelines and the metalearner on different data. #1897
Most helpful comment
Plan: if stacking is enabled, we'll create a separate split which can be passed to stacked ensembling for CV
It could be neat to look into supporting using the out-of-sample predictions (validation split from original CV) as the data passed to stacking. However I suggest we start with the simpler approach of just creating a separate split if stacking is enabled.
RE our discussion, some supporting evidence for why we should withhold a separate split which stacked ensembling can use to perform CV:
@rpeck FYI, after some spinning we are following your suggestion 😆