Evalml: BalancedClassificationDataCVSplit produces different splits each time it's called

Created on 16 Mar 2021 · 3Comments · Source: alteryx/evalml

Repro

import joblib
from evalml.demos import load_fraud
from evalml.preprocessing.data_splitters import BalancedClassificationDataCVSplit

splitter = BalancedClassificationDataCVSplit(n_splits=3, random_seed=0, shuffle=True)

X, y = load_fraud(5000)
X = X.to_dataframe()
y = y.to_series().astype("int")

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('75f1b95d7ce307ac6c793055330969aa', '8c89fe1a592c50a700b6d5cbb02dba8b')
('f8c849bbfbed37c13f66c5c742e237cb', '9c4879fb550fded8be9ac03e95a1bf95')
('cdc21f0d6bbf45459c9695258f7f04dc', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('bf462b82af243c552ac48acad2dfd748', '8c89fe1a592c50a700b6d5cbb02dba8b')
('b8341b536c63c7957c099b05e315f49c', '9c4879fb550fded8be9ac03e95a1bf95')
('780e74673b601790037fc0b17dde56fe', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)

# Output
('385f6c538568ad3a33cf84f61d94144c', '8c89fe1a592c50a700b6d5cbb02dba8b')
('8db65d0a3bdf87ae0f135b9766a260dd', '9c4879fb550fded8be9ac03e95a1bf95')
('2a7293fc1308b8a572091d7c76d20205', '5b575765bbe176e732b8eb4dc1bf2822')

This is different from the behavior of the sklearn splitter:

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=3, random_state=0, shuffle=True)

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

#Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

I think this is problematic for two reasons:

Since BalancedClassificationDataCVSplit is the default splitter in automl, it means our pipelines are evaluated on different splits
Since split modifies the state of the data splitter it means we'll have different results between the sequential and parallel engines.

bug

Source

freddyaboulton

👀1 🚀1

All 3 comments

Thanks for pointing this out.

Personally, this behavior doesn't bother me. As long as every time we initialize with a certain seed, we get the same sequence of output after that point, we're good. I'd be concerned if we were not respecting the random seed; but that's not what this issue tracks.

My recommendation: do nothing. As such, closing.

@freddyaboulton if you disagree about this behavior, let's duke it out, I mean talk 😅

dsherry on 18 Mar 2021

👀1 👍1

@dsherry I think this worth changing for two reasons:

It introduces variances to automl search because different pipelines are evaluated on different data. This makes the rankings table slightly misleading because the scores are not computed on the same data.
It's bad for parallel automl search

Let me elaborate on 2. With the current behavior, the sequential engine is expected to modify the state of the data splitter throughout search. In parallel evalml, we pickle the data splitter and send it to workers to compute the split. Since the workers get a copy of the splitter, they don't modify the state of the original data splitter.

This introduces a difference in behavior in between the sequential and parallel engines because the splits would not match depending on the order the pipeline is evaluated! This means that the same pipeline/parameter combo would get different results in the sequential engine and parallel engine and I think that's undesirable.

In my opinion, point 1 is reason enough to fix this because all of our pipelines should be evaluated on the same data if we want to be able to compare them meaningfully. But as we move towards parallel evalml, I think it's important we make sure that modifying global state is not part of our expected behavior.

freddyaboulton on 18 Mar 2021

The plan moving forward:

Fix this issue by modifying the BalancedClassificationDataCVSplit
Longer term we'd like to write tests that verify we don't feed different splits to different pipelines in automl search.

Thanks for the discussion everyone!

freddyaboulton on 19 Mar 2021

🚀1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Warning messages in unit test: "invalid value encountered in double_scalars" and others

dsherry · 3Comments

Docs:Back Arrow on Install Page

chukarsten · 4Comments

Update pipeline and components to return Woodwork data structures

angela97lin · 5Comments

Imputer cannot fit when there is None in a categorical or boolean column

freddyaboulton · 3Comments

SHAP test fails with the Elastic Net Classifier estimator

bchen1116 · 4Comments