Evalml: AutoML: use separate CV split for ensembling

Created on 26 Jan 2021  ·  4Comments  ·  Source: alteryx/evalml

#1732 updates our stacked ensembling pipeline to use the same data splitter that is used in AutoML. However, @rpeck noted that we may not want to do this. We continued with #1732 because we believed that it was still an improvement on our current approach (scikit-learn default).

This issue tracks long-term updates we may want to make to our data splitter for stacking in AutoML.


Update: while continuing to update #1732, we ran into a conundrum with the interaction between stacking and AutoML that made us revisit whether it really was a good idea to use the same data split for the stacking ensemble as we use in AutoML. We decided no, and that as Raymond had pointed out, we probably want to use a separate CV for our ensemble. (@dsherry also mentioned a good nugget of info that using ensembling allows the model to put less excessive importance on the more complex models, so CV probably helps with that--please correct me if I've paraphrased incorrectly 😂 ).

Rather than continuing that work then, we should use this issue to discuss updating AutoML for stacking: Specifically, we should create a separate split for CV for the stacked ensembling. This would be similar to what we currently have in place for binary-threshold tuning.

enhancement performance

Most helpful comment

Plan: if stacking is enabled, we'll create a separate split which can be passed to stacked ensembling for CV

It could be neat to look into supporting using the out-of-sample predictions (validation split from original CV) as the data passed to stacking. However I suggest we start with the simpler approach of just creating a separate split if stacking is enabled.

RE our discussion, some supporting evidence for why we should withhold a separate split which stacked ensembling can use to perform CV:

  • "By using the cross-validated [predictions,] stacking avoids giving unfairly high weight to models with higher complexity." AKA overfitting
  • "The most common approach to preparing the training dataset for the meta-model is via k-fold cross-validation of the base models, where the out-of-fold predictions are used as the basis for the training dataset for the meta-model. The training data for the meta-model may also include the inputs to the base models, e.g. input elements of the training data. This can provide an additional context to the meta-model as to how to best combine the predictions from the meta-model. Once the training dataset is prepared for the meta-model, the meta-model can be trained in isolation on this dataset, and the base-models can be trained on the entire original training dataset." -- blog post
  • "It is important that the meta-learner is trained on a separate dataset to the examples used to train the level 0 models to avoid overfitting." -- another blog post
  • Original paper abstract which discusses how stacked ensembling can be viewed as a generalization of cross-validation
  • I also found this to be a good read.

@rpeck FYI, after some spinning we are following your suggestion 😆

All 4 comments

Plan: if stacking is enabled, we'll create a separate split which can be passed to stacked ensembling for CV

It could be neat to look into supporting using the out-of-sample predictions (validation split from original CV) as the data passed to stacking. However I suggest we start with the simpler approach of just creating a separate split if stacking is enabled.

RE our discussion, some supporting evidence for why we should withhold a separate split which stacked ensembling can use to perform CV:

  • "By using the cross-validated [predictions,] stacking avoids giving unfairly high weight to models with higher complexity." AKA overfitting
  • "The most common approach to preparing the training dataset for the meta-model is via k-fold cross-validation of the base models, where the out-of-fold predictions are used as the basis for the training dataset for the meta-model. The training data for the meta-model may also include the inputs to the base models, e.g. input elements of the training data. This can provide an additional context to the meta-model as to how to best combine the predictions from the meta-model. Once the training dataset is prepared for the meta-model, the meta-model can be trained in isolation on this dataset, and the base-models can be trained on the entire original training dataset." -- blog post
  • "It is important that the meta-learner is trained on a separate dataset to the examples used to train the level 0 models to avoid overfitting." -- another blog post
  • Original paper abstract which discusses how stacked ensembling can be viewed as a generalization of cross-validation
  • I also found this to be a good read.

@rpeck FYI, after some spinning we are following your suggestion 😆

@dsherry @rpeck @angela97lin I started looking at this issue, but it seems like sklearn's StackedClassifier and StackedRegressor classes do use internal cross validation during the training of the model in order to prevent overfitting. This seems to be the same problem that we are trying to solve with this issue, so it seems like it should be resolved. I don't think we'll need to do a separate CV fold for training/validating the Stacked Ensembling methods, but what do you all think?

image

After discussion with @dsherry, here's the idea we want to proceed with

Plan discussed with @bchen1116 : this issue tracks:

  • Create separate split for training the metalearning for ensemble pipelines
  • Continue to use sklearn impl for stacked ensembling

Separate performance enhancement: better support for small data. Don't create separate split for ensemling. Use out-of-sample pipeline preds from normal CV (from across all the CV folds) to train the ensembler. #1898

Another separate performance enhancement: train the pipelines and the metalearner on different data. #1897

Was this page helpful?
0 / 5 - 0 ratings