Evalml: Stacked ensemble performing poorly

Created on 6 Apr 2021  ·  11Comments  ·  Source: alteryx/evalml

Steps to reproduce:

  1. Load Happiness dataset into evalml
  2. Run long enough to include ensembling
  3. The baseline regressor shows as ranked higher than the stacked regressor.
    Happiness Data Full Set.csv.zip
bug performance

All 11 comments

@dancuarini I tried to reproduce this locally but was not able to; could be because of additional steps before running AutoMLSearch (ex: data split size, dropping cols). Let's talk about the problem configuration!

Here's what I tried to run locally:

from evalml.automl import AutoMLSearch
import pandas as pd
import woodwork as ww
from evalml.automl.callbacks import raise_error_callback

happiness_data_set = pd.read_csv("Happiness Data Full Set.csv")
y = happiness_data_set['Happiness']
X = happiness_data_set.drop(['Happiness'], axis=1)
# display(X.head())

X = ww.DataTable(X)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='regression', test_size=0.2, random_seed=0)
# print(X.types)

automl = AutoMLSearch(X, y, problem_type="regression", objective="MAE", error_callback=raise_error_callback, max_batches=20, ensembling=True)
automl.search()

This results in the following rankings:

image

Current progress: discussed with @dancuarini about not being able to repro locally, will keep in touch with @Cmancuso about repro-ing and next steps.

@angela97lin wait, are you sure you couldn't repro this? Here the stacked ensembler shows up in the middle of the rankings--I'd expect it to be at the top!

Thanks for sharing the reproducer :)

@dsherry While it is a little suspicious that the stacked ensembler isn't at the top, the original issue was that the stacked ensembler was performing so poorly that it was ranked above the baseline regressor!

@angela97lin ah yes understood! I sent you some notes.

I think any evidence that our ensembles aren't always close to the top is a problem.

Dug into this a bit more. I think there are some potential reasons why the ensembler performs poorly with this data set:

  1. The dataset is really small, and our current data splitting strategy means that the ensembler is provided with and validated on a very small subset of data. Right now, if we want to train a stacked ensembler, we split some data (identified with ensembling_indices) for the ensembler to train on. This is to prevent overfitting the ensembler via training the metalearner on the same data that the input pipelines were already trained on. We then do one CV split, further splitting the data from the ensembling_indices. For this dataset of 128 rows, we train and validate on 17 and 8 rows, respectively. I filed #2144 to discuss whether we want to do this additional CV split.

  2. Our ensembler is currently constructed by taking the best pipeline of each model family found and using that as the input pipelines for the stacked ensembler. However, if some of the input pipelines perform quite poorly, then the stacked ensembler may not perform as well as a high-performing individual pipeline.

For example, this is the final rankings table:
image

We notice that the stacked ensemble performs right smack in the middle--if we simplify and say that the stacked ensemble averages the predictions of its input pipelines, this makes sense. To test my hypothesis, I decided to only use the model families that performed better than the stacked ensembler, rather than all of the model families, and noticed that the resulting score performs much better than any individual pipeline. This leads me to believe that the poor-performing individual pipelines led the stacked ensembler to perform worse.

Here's the repro code for this:

From above:

import pandas as pd
import woodwork as ww
happiness_data_set = pd.read_csv("Happiness Data Full Set.csv")
y = happiness_data_set['Happiness']
X = happiness_data_set.drop(['Happiness'], axis=1)

X = ww.DataTable(X)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='regression', test_size=0.25, random_seed=0)

automl = AutoMLSearch(X, y, problem_type="regression", objective="MAE", error_callback=raise_error_callback, max_batches=10, ensembling=True)
automl.search()

import woodwork as ww
from evalml.automl.engine import train_and_score_pipeline
from evalml.automl.engine.engine_base import JobLogger

# Get the pipelines fed into the ensemble but only use the ones better than the stacked ensemble
input_pipelines = []
input_info = automl._automl_algorithm._best_pipeline_info
from evalml.model_family import ModelFamily

trimmed = dict()
trimmed.update({ModelFamily.RANDOM_FOREST: input_info[ModelFamily.RANDOM_FOREST]})
trimmed.update({ModelFamily.XGBOOST: input_info[ModelFamily.XGBOOST]})
trimmed.update({ModelFamily.DECISION_TREE: input_info[ModelFamily.EXTRA_TREES]})

for pipeline_dict in trimmed.values():
    pipeline_class = pipeline_dict['pipeline_class']
    pipeline_params = pipeline_dict['parameters']
    input_pipelines.append(pipeline_class(parameters=automl._automl_algorithm._transform_parameters(pipeline_class, pipeline_params),
                                                      random_seed=automl._automl_algorithm.random_seed))
ensemble_pipeline = _make_stacked_ensemble_pipeline(input_pipelines, "regression")
X_train = X.iloc[automl.ensembling_indices]
y_train = ww.DataColumn(y.iloc[automl.ensembling_indices])
train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger())

By just using these three model families, we get a MAE score of ~0.22, which is much better than any individual pipeline.

#output of train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger())
{'scores': {'cv_data': [{'all_objective_scores': OrderedDict([('MAE',
                  0.22281276417465426),
                 ('ExpVariance', 0.9578811127332543),
                 ('MaxError', 0.3858477236606914),
                 ('MedianAE', 0.2790362808260225),
                 ('MSE', 0.0642654425375983),
                 ('R2', 0.9152119239698017),
                 ('Root Mean Squared Error', 0.2535062968401343),
                 ('# Training', 17),
                 ('# Validation', 9)]),
    'mean_cv_score': 0.22281276417465426,
    'binary_classification_threshold': None}],
  'training_time': 9.944366216659546,
  'cv_scores': 0    0.222813
  dtype: float64,
  'cv_score_mean': 0.22281276417465426},
 'pipeline': TemplatedPipeline(parameters={'Stacked Ensemble Regressor':{'input_pipelines': [GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Regressor':{'n_estimators': 184, 'max_depth': 25, 'n_jobs': -1},}), GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'XGBoost Regressor':{'eta': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100},}), GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Extra Trees Regressor':{'n_estimators': 100, 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1},})], 'final_estimator': None, 'cv': None, 'n_jobs': -1},}),

This makes me wonder if we need to rethink about what input pipelines we should feed to our stacked ensembler.

  1. The metalearner we're using (LinearRegressor) is not the best. I tested this via the stacking_test branch I created where I updated the default metalearner to RidgeCV (scikit-learn default, but we don't have in EvalML), and the the ensembler performs much better:
    image

Next steps after discussion with @dsherry:

Try #1 and #3 (using Elastic Net) on other datasets, run perf tests, see if we can get better performance overall.

@angela97lin Your points about the splitting, for tiny datasets, are right on target. Eventually, we need to handle tiny datasets really differently than bigger ones, e.g. by only using high-fold-count xval on the entire dataset, even LOOCV, and making sure we construct the folds differently for the ensemble metalearner training.

I also agree that the metalearner needs to use strong regularization. I used Elastic Net in H2O-3 StackedEnsemble, and only remember one time that the ensemble came in second in the leaderboard. Every other time I tested, it was first. The regularization should never allow poor models to bring down the performance of the ensemble.

And this was feeding the entire leaderboard of even 50 models into the metalearner. :-)

Just posting some extra updates on this:

Tested locally using all of the regression datasets. Results can be found here or just the charts here.

From this:

  • Agreed @rpeck! We should update the metalearner to use strong regularization for sure. ElasticNetCV seemed to perform better than our LinearRegressor on many datasets. This issue tracks this: https://github.com/alteryx/evalml/issues/1739
  • @dsherry and I rediscussed our data splitting strategy: Right now, we split off data for the ensemble. However, this is under the assumption that we want the metalearner to be trained on this ensemble indices. With the scikit-learn implementation, when we train our StackedEnsembler on this ensemble indices split, we end up training both the input pipelines and the metalearner on this small set of data. This could likely be why we are not performing well. While the parameters for our input pipelines are from tuning using by the other data, these pipelines are not fitted. In the long term, rolling our own implementation could allow us to pass in trained pipelines to the ensembler, in which case we would have the behavior we want. For now, that is not the case.

Next step: Test out this hypothesis with ensembler manually. Try to manually train input pipelines on 80% of data, create cross-validated predictions on data set aside for ensembling and train metalearner with out-predictions.

Results from experimentation look good: https://alteryx.quip.com/4hEyAaTBZDap/Ensembling-Performance-Using-More-Data

Next steps:

After some digging around, we believe that the issue is not with how the ensemble performs, but rather, how we report the ensemble’s performance. Currently, we do a separate ensemble split which is 20% of the data, and then do another train-validation split, and report the score of the ensemble as the validation data. This means that in some cases, the ensemble score is calculated using a very small number of rows (as the happiness dataset above).

By removing the ensemble indices split and using our old method of calculating the cv training score for the ensemble (give it all data, train and validate on one fold), we see that the ensemble is ranked higher in almost all cases, and comes up as #1 in many more cases. Meanwhile, the validation score is the same or slightly better.

Note that since we don’t do any hyperparameter tuning, input pipelines are not trained, and the ensemble only gets the outpredictions of the input pipelines as input, overfitting is not an issue. We can revisit implementing our own ensemble and update the splitting strategy then, but for now, we’re able to see improvements by just changing the data split strategy and scikit-learn’s implementation.

Note that this will cause an increase in fit time when ensembling is enabled: all pipelines see more data (no reserved ensemble indices), and the ensemble is trained on more data. I think this is fine.

Results tabulated here: https://alteryx.quip.com/jI2mArnWZfTU/Ensembling-vs-Best-Pipeline-Validation-Scores#MKWACADlCDt

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dsherry picture dsherry  ·  5Comments

angela97lin picture angela97lin  ·  4Comments

chukarsten picture chukarsten  ·  4Comments

dsherry picture dsherry  ·  3Comments

angela97lin picture angela97lin  ·  4Comments