Evalml: BalancedClassificationDataCVSplit produces different splits each time it's called

Created on 16 Mar 2021  ·  3Comments  ·  Source: alteryx/evalml

Repro

import joblib
from evalml.demos import load_fraud
from evalml.preprocessing.data_splitters import BalancedClassificationDataCVSplit

splitter = BalancedClassificationDataCVSplit(n_splits=3, random_seed=0, shuffle=True)

X, y = load_fraud(5000)
X = X.to_dataframe()
y = y.to_series().astype("int")

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('75f1b95d7ce307ac6c793055330969aa', '8c89fe1a592c50a700b6d5cbb02dba8b')
('f8c849bbfbed37c13f66c5c742e237cb', '9c4879fb550fded8be9ac03e95a1bf95')
('cdc21f0d6bbf45459c9695258f7f04dc', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('bf462b82af243c552ac48acad2dfd748', '8c89fe1a592c50a700b6d5cbb02dba8b')
('b8341b536c63c7957c099b05e315f49c', '9c4879fb550fded8be9ac03e95a1bf95')
('780e74673b601790037fc0b17dde56fe', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)

# Output
('385f6c538568ad3a33cf84f61d94144c', '8c89fe1a592c50a700b6d5cbb02dba8b')
('8db65d0a3bdf87ae0f135b9766a260dd', '9c4879fb550fded8be9ac03e95a1bf95')
('2a7293fc1308b8a572091d7c76d20205', '5b575765bbe176e732b8eb4dc1bf2822')

This is different from the behavior of the sklearn splitter:

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=3, random_state=0, shuffle=True)

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

#Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

I think this is problematic for two reasons:

  1. Since BalancedClassificationDataCVSplit is the default splitter in automl, it means our pipelines are evaluated on different splits
  2. Since split modifies the state of the data splitter it means we'll have different results between the sequential and parallel engines.
bug

All 3 comments

Thanks for pointing this out.

Personally, this behavior doesn't bother me. As long as every time we initialize with a certain seed, we get the same sequence of output after that point, we're good. I'd be concerned if we were not respecting the random seed; but that's not what this issue tracks.

My recommendation: do nothing. As such, closing.

@freddyaboulton if you disagree about this behavior, let's duke it out, I mean talk 😅

@dsherry I think this worth changing for two reasons:

  1. It introduces variances to automl search because different pipelines are evaluated on different data. This makes the rankings table slightly misleading because the scores are not computed on the same data.
  2. It's bad for parallel automl search

    Let me elaborate on 2. With the current behavior, the sequential engine is expected to modify the state of the data splitter throughout search. In parallel evalml, we pickle the data splitter and send it to workers to compute the split. Since the workers get a copy of the splitter, they don't modify the state of the original data splitter.

This introduces a difference in behavior in between the sequential and parallel engines because the splits would not match depending on the order the pipeline is evaluated! This means that the same pipeline/parameter combo would get different results in the sequential engine and parallel engine and I think that's undesirable.

In my opinion, point 1 is reason enough to fix this because all of our pipelines should be evaluated on the same data if we want to be able to compare them meaningfully. But as we move towards parallel evalml, I think it's important we make sure that modifying global state is not part of our expected behavior.

The plan moving forward:

  1. Fix this issue by modifying the BalancedClassificationDataCVSplit
  2. Longer term we'd like to write tests that verify we don't feed different splits to different pipelines in automl search.

Thanks for the discussion everyone!

Was this page helpful?
0 / 5 - 0 ratings