Scikit-learn: Stratified GroupKFold

Created on 11 Apr 2019  ·  48Comments  ·  Source: scikit-learn/scikit-learn

Description

Currently sklearn does not have a stratified group kfold feature. Either we can use stratification or we can use group kfold. However, it would be good to have both.

I would like to implement it, if we decide to have it.

Most helpful comment

It would be good if people that are interested could describe their use-case and what they really want out of this.

Very common use-case in medicine and biology when you have repeated measures.
An example: Assume you want to classify a disease, e.g. Alzheimer's disease (AD) vs. healthy controls from MR images. For the same subject, you might have several scans (from follow-up sessions or longitudinal data). Let's assume you have a total of 1000 subjects, 200 of them being diagnosed with AD (imbalanced classes). Most subjects have one scan, but for some of them 2 or 3 images are available. When training/testing the classifier, you want to make sure that images from the same subject are always in the same fold to avoid data leakage.
It's best to use StratifiedGroupKFold for this: stratify to account for class imbalance but with the group constraint that a subject must not appear in different folds.
NB: It would be nice to make it repeatable.

Below an example implementation, inspired by kaggle-kernel.

import numpy as np
from collections import Counter, defaultdict
from sklearn.utils import check_random_state

class RepeatedStratifiedGroupKFold():

    def __init__(self, n_splits=5, n_repeats=1, random_state=None):
        self.n_splits = n_splits
        self.n_repeats = n_repeats
        self.random_state = random_state

    # Implementation based on this kaggle kernel:
    #    https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def split(self, X, y=None, groups=None):
        k = self.n_splits
        def eval_y_counts_per_fold(y_counts, fold):
            y_counts_per_fold[fold] += y_counts
            std_per_label = []
            for label in range(labels_num):
                label_std = np.std(
                    [y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]
                )
                std_per_label.append(label_std)
            y_counts_per_fold[fold] -= y_counts
            return np.mean(std_per_label)

        rnd = check_random_state(self.random_state)
        for repeat in range(self.n_repeats):
            labels_num = np.max(y) + 1
            y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
            y_distr = Counter()
            for label, g in zip(y, groups):
                y_counts_per_group[g][label] += 1
                y_distr[label] += 1

            y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
            groups_per_fold = defaultdict(set)

            groups_and_y_counts = list(y_counts_per_group.items())
            rnd.shuffle(groups_and_y_counts)

            for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
                best_fold = None
                min_eval = None
                for i in range(k):
                    fold_eval = eval_y_counts_per_fold(y_counts, i)
                    if min_eval is None or fold_eval < min_eval:
                        min_eval = fold_eval
                        best_fold = i
                y_counts_per_fold[best_fold] += y_counts
                groups_per_fold[best_fold].add(g)

            all_groups = set(groups)
            for i in range(k):
                train_groups = all_groups - groups_per_fold[i]
                test_groups = groups_per_fold[i]

                train_indices = [i for i, g in enumerate(groups) if g in train_groups]
                test_indices = [i for i, g in enumerate(groups) if g in test_groups]

                yield train_indices, test_indices

Comparing RepeatedStratifiedKFold (sample of same group might appear in both folds) with RepeatedStratifiedGroupKFold:

import matplotlib.pyplot as plt
from sklearn import model_selection

def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=lw, cmap=plt.cm.coolwarm,
                   vmin=-.2, vmax=1.2)

    ax.scatter(range(len(X)), [ii + 1.5] * len(X), c=y, marker='_',
               lw=lw, cmap=plt.cm.Paired)
    ax.scatter(range(len(X)), [ii + 2.5] * len(X), c=group, marker='_',
               lw=lw, cmap=plt.cm.tab20c)

    yticklabels = list(range(n_splits)) + ['class', 'group']
    ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+2.2, -.2], xlim=[0, 100])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)


# demonstration
np.random.seed(1338)
n_splits = 4
n_repeats=5


# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)

percentiles_classes = [.4, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

# Evenly spaced groups
g = np.hstack([[ii] * 5 for ii in range(20)])


fig, ax = plt.subplots(1,2, figsize=(14,4))

cv_nogrp = model_selection.RepeatedStratifiedKFold(n_splits=n_splits,
                                                   n_repeats=n_repeats,
                                                   random_state=1338)
cv_grp = RepeatedStratifiedGroupKFold(n_splits=n_splits,
                                      n_repeats=n_repeats,
                                      random_state=1338)

plot_cv_indices(cv_nogrp, X, y, g, ax[0], n_splits * n_repeats)
plot_cv_indices(cv_grp, X, y, g, ax[1], n_splits * n_repeats)

plt.show()

RepeatedStratifiedGroupKFold_demo

All 48 comments

@TomDLT @NicolasHug What do you think?

Might be interesting in theory, but I'm not sure how useful it'd be in practice. We can certainly keep the issue open and see how many people request this feature

Do you assume that each group is in a single class?

See also #9413

@jnothman Yes, I had a similar thing in mind. However, I see that the pull request is still open. I meant that a group will not be repeated across folds. If we have ID as groups then a same ID will not occur across multiple folds

I understand this is relevant to use of RFECV.
Currently this defaults to using a StratifiedKFold cv. Its fit() also takes groups=
However: it appears that groups is not respected when executing fit(). No warning (might be considered a bug).

Grouping AND stratification are useful for quite imbalanced datasets with inter-record dependency
(in my case, the same individual has multiple records, but there are still a large number of groups=people relative to the number of splits; I imagine there would be practical problems as the number of unique groups in the minority class gets anywhere near the number of splits).

So: +1!

This would definitely be useful. For instance, working with highly imbalanced time-series medical data, keeping patients separate but (approximately) balance the imbalanced class in each fold.

I have also found that StratifiedKFold takes groups as a parameter but doesn't group according to them, should probably be flagged up.

Another good use of this feature would be financial data, which is usually very imbalanced. In my case, I have a highly imbalanced dataset with several records for the same entity (just different points in time). We want to do a GroupKFold to avoid leakage, but also stratify since due to the high imbalance, we could end up with groups with very few or none positives.

also see #14524 I think?

Another use case for Stratified GroupShuffleSplit and GroupKFold is biological "repeated measures" designs, where you have multiple samples per subject or other parent biological unit. Also in many real world datasets in biology there is class imbalance. Each group of samples has the same class. So it's important to stratify and keep groups together.

Description

Currently sklearn does not have a stratified group kfold feature. Either we can use stratification or we can use group kfold. However, it would be good to have both.

I would like to implement it, if we decide to have it.

Hi, I think it would be quite useful for medicine ML. Is it implemented already?

@amueller Do you think we should implement this, given that people are interested in this?

I'm very interested too... it would be really useful in spectroscopy when you have several replicates measures for each of your sample, they really need to stay in the same fold during cross-validation. And if you have several unbalanced classes that you are trying to classify you really want to use the stratify feature too. Therefore I vote for it too! Sorry I'm not good enough to participate in the development but for those who will take part in that you can be sure it will be used :-)
thumbs up for the all team. thanks!

Please look at referenced issues and PRs in this thread as work has at least been attempted on StratifiedGroupKFold. I've already done a StratifiedGroupShuffleSplit #15239 which just needs tests but I've already used for my own work quite a bit.

I think we should implement it, but I think I still don't know what we actually want. @hermidalc has a restriction that members of the same group must be of the same class. That's not the general case, right?

It would be good if people that are interested could describe their use-case and what they really want out of this.

There are #15239 #14524 and #9413 which I remember all having different semantics.

@amueller totally agree with you, I spent a few hours today looking for something between the different versions available (#15239 #14524 and #9413) but couldn't really understand if any of these would fit my need. So here is my use case if it can help:
I have 1000 samples. each sample has been measured 3 times with a NIR Spectrometer, so each sample has 3 replicates that I want to stay together all the way...
These 1000 samples belong to 6 different classes with very different number of samples in each:
class 1: 400 samples
class 2: 300 samples
class 3: 100 samples
class 4: 100 samples
class 5: 70 samples
class 6: 30 samples
I want to build a classifier for each class. So class 1 vs all other classes, then class 2 vs all other classes, etc.
To maximize the accuracy of each of my classifier it is important that I have samples of the 6 classes represented in each of the fold, because my classes are not so different therefore it really helps to create an accurate border to have always the 6 classes represented in each fold.

This is why I believe a stratified (Always my 6 classes represented in each fold) group (keep always the 3 replicate measures of each of my sample together) kfold seems to be very much what I am looking for here.
Any opinion?

My use case and why I wrote up StratifiedGroupShuffleSplit is to support repeated measures designs https://en.wikipedia.org/wiki/Repeated_measures_design. In my use cases members of the same group must be of the same class.

@fcoppey For you, the samples within a group always have the same class, right?

@hermidalc I'm not very familiar with the terminology, but from wikipedia it sounds like "repeated measure design" doesn't mean the same group must be within the same class as it says "A crossover trial has a repeated measures design in which each patient is assigned to a sequence of two or more treatments, of which one may be a standard treatment or a placebo."
Relating this to an ML setting, you could either try to predict from measurements whether an individual just received treatment or placebo, or you could try to predict an outcome given the treatment.
For either of those the class for the same individual could change, right?

Irrespective of the name, it sounds to me like you both have the same use case, while I was thinking about a case similar to what's described in the crossover study. Or maybe a bit more simple: you could have a patient become sick over time (or get better), so the outcome for a patient could change.

Actually the wikipedia article you link to explicitly says "Longitudinal analysis—Repeated measure designs allow researchers to monitor how participants change over time, both long- and short-term situations.", so I think that means that changing the class is included.
If there's another word that means that the measurement is done under the same conditions then we could use that word?

@amueller yes you’re right, I realized I miswrote above where I meant to say in my use cases of this design not in this use case in general.

There can be many quite elaborate types of repeated measures designs, though in the two types I’ve needed StratifiedGroupShuffleSplit the within group same class restriction holds (longitudinal sampling before and after treatment when predicting treatment response, multiple pre-treatment samples per subject at different body locations when predicting treatment response).

I needed something right away that works so wanted to put it out there for others to use and to get something started on sklearn, plus if I’m not mistaken it’s more complicated to design the stratification logic when within group class labels can be different.

@amueller yes always. They are replicates of a same measure in order to include the intravariability of the device in the prediction.

@hermidalc yes, this case is much easier. If it's a common need, I'm happy for us to add it. We should just make sure that from the name it's somewhat clear what it does, and we should think about whether these two versions should live in the same class.

It should be quite easy to make StratifiedKFold do this. There's two options: ensure that each fold contains a similar number of samples, or ensure each fold contains a simliar number of groups.
The second one is trivially to do (by just pretending each group is a single point and passing to StratifiedKFold). That's what you do in your PR, it looks like.

GroupKFold I think heuristically trades off the two off them by adding to the smallest fold first. I'm not sure how that would translate to the stratified case, so I'm happy with using your approach.

Should we also add GroupStratifiedKFold in the same PR? Or leave that for later?
The other PRs have slightly different goals. It would be good if someone could write up what the different use-cases are (I probably don't have the time right now).

+1 for separately handling the group constraint where all samples have the same class.

@hermidalc yes, this case is much easier. If it's a common need, I'm happy for us to add it. We should just make sure that from the name it's somewhat clear what it does, and we should think about whether these two versions should live in the same class.

I'm not totally understanding this, a StratifiedGroupShuffleSplit and StratifiedGroupKFold where you can have members of each group be of different classes should have the exact same split behavior when the user specifies all group members to be of the same class. When can just improve the internals later and existing behavior will be the same?

The second one is trivially to do (by just pretending each group is a single point and passing to StratifiedKFold). That's what you do in your PR, it looks like.

GroupKFold I think heuristically trades off the two off them by adding to the smallest fold first. I'm not sure how that would translate to the stratified case, so I'm happy with using your approach.

Should we also add GroupStratifiedKFold in the same PR? Or leave that for later?
The other PRs have slightly different goals. It would be good if someone could write up what the different use-cases are (I probably don't have the time right now).

I will add StatifiedGroupKFold using the "each group single sample" approach I used.

It would be good if people that are interested could describe their use-case and what they really want out of this.

Very common use-case in medicine and biology when you have repeated measures.
An example: Assume you want to classify a disease, e.g. Alzheimer's disease (AD) vs. healthy controls from MR images. For the same subject, you might have several scans (from follow-up sessions or longitudinal data). Let's assume you have a total of 1000 subjects, 200 of them being diagnosed with AD (imbalanced classes). Most subjects have one scan, but for some of them 2 or 3 images are available. When training/testing the classifier, you want to make sure that images from the same subject are always in the same fold to avoid data leakage.
It's best to use StratifiedGroupKFold for this: stratify to account for class imbalance but with the group constraint that a subject must not appear in different folds.
NB: It would be nice to make it repeatable.

Below an example implementation, inspired by kaggle-kernel.

import numpy as np
from collections import Counter, defaultdict
from sklearn.utils import check_random_state

class RepeatedStratifiedGroupKFold():

    def __init__(self, n_splits=5, n_repeats=1, random_state=None):
        self.n_splits = n_splits
        self.n_repeats = n_repeats
        self.random_state = random_state

    # Implementation based on this kaggle kernel:
    #    https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def split(self, X, y=None, groups=None):
        k = self.n_splits
        def eval_y_counts_per_fold(y_counts, fold):
            y_counts_per_fold[fold] += y_counts
            std_per_label = []
            for label in range(labels_num):
                label_std = np.std(
                    [y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]
                )
                std_per_label.append(label_std)
            y_counts_per_fold[fold] -= y_counts
            return np.mean(std_per_label)

        rnd = check_random_state(self.random_state)
        for repeat in range(self.n_repeats):
            labels_num = np.max(y) + 1
            y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
            y_distr = Counter()
            for label, g in zip(y, groups):
                y_counts_per_group[g][label] += 1
                y_distr[label] += 1

            y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
            groups_per_fold = defaultdict(set)

            groups_and_y_counts = list(y_counts_per_group.items())
            rnd.shuffle(groups_and_y_counts)

            for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
                best_fold = None
                min_eval = None
                for i in range(k):
                    fold_eval = eval_y_counts_per_fold(y_counts, i)
                    if min_eval is None or fold_eval < min_eval:
                        min_eval = fold_eval
                        best_fold = i
                y_counts_per_fold[best_fold] += y_counts
                groups_per_fold[best_fold].add(g)

            all_groups = set(groups)
            for i in range(k):
                train_groups = all_groups - groups_per_fold[i]
                test_groups = groups_per_fold[i]

                train_indices = [i for i, g in enumerate(groups) if g in train_groups]
                test_indices = [i for i, g in enumerate(groups) if g in test_groups]

                yield train_indices, test_indices

Comparing RepeatedStratifiedKFold (sample of same group might appear in both folds) with RepeatedStratifiedGroupKFold:

import matplotlib.pyplot as plt
from sklearn import model_selection

def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=lw, cmap=plt.cm.coolwarm,
                   vmin=-.2, vmax=1.2)

    ax.scatter(range(len(X)), [ii + 1.5] * len(X), c=y, marker='_',
               lw=lw, cmap=plt.cm.Paired)
    ax.scatter(range(len(X)), [ii + 2.5] * len(X), c=group, marker='_',
               lw=lw, cmap=plt.cm.tab20c)

    yticklabels = list(range(n_splits)) + ['class', 'group']
    ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+2.2, -.2], xlim=[0, 100])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)


# demonstration
np.random.seed(1338)
n_splits = 4
n_repeats=5


# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)

percentiles_classes = [.4, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

# Evenly spaced groups
g = np.hstack([[ii] * 5 for ii in range(20)])


fig, ax = plt.subplots(1,2, figsize=(14,4))

cv_nogrp = model_selection.RepeatedStratifiedKFold(n_splits=n_splits,
                                                   n_repeats=n_repeats,
                                                   random_state=1338)
cv_grp = RepeatedStratifiedGroupKFold(n_splits=n_splits,
                                      n_repeats=n_repeats,
                                      random_state=1338)

plot_cv_indices(cv_nogrp, X, y, g, ax[0], n_splits * n_repeats)
plot_cv_indices(cv_grp, X, y, g, ax[1], n_splits * n_repeats)

plt.show()

RepeatedStratifiedGroupKFold_demo

+1 for stratifiedGroupKfold. I am trying to detect falls of seniors, taking sensors from the samrt watch. since we don't have much fall data - we do simulations with different watches that get different classes. I also do augmentations on the data before I train it. from each data point I create 9 points- and this is a group. it is important that a group will not be both in train and test as explained

I would like to be able to use StratifiedGroupKFold as well. I am looking at a dataset for predicting financial crises, where the years before, after and during each crisis is its own group. During training and cross-validation, members of each group should not leak between the folds.

Is there anyway to generalize that for multilabel scenario (Multilabel_
stratifiedGroupKfold)?

+1 for this. We're analyzing user accounts for spam, so we want to group by user, but also stratify because spam is relatively low-incidence. For our use case, any user who spams once is flagged as a spammer in all data, so a group member will always have the same label.

Thanks for providing a classic use case to frame the documentation,
@philip-iv!

I added a StratifiedGroupKFold implementation to my same PR #15239 as StratifiedGroupShuffleSplit.

Though as you can see in the PR the logic for both is much simpler than https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment-557802602 because mine only attempts to preserve the percentage of groups for each class (not percentage of samples) so that I can leverage the existing StratifiedKFold and StratifiedShuffleSplit code by passing it unique group information. But both implementations do produce folds where each group's samples stay together in the same fold.

Though I would vote for more sophisticated methods based on https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment-557802602

Here are full-fledged versions of StratifiedGroupKFold and RepeatedStratifiedGroupKFold using the code @mrunibe provided which I further simplified and changed a couple things. These classes also follow the design of how other sklearn CV classes of the same type are done.

class StratifiedGroupKFold(_BaseKFold):
    """Stratified K-Folds iterator variant with non-overlapping groups.

    This cross-validation object is a variation of StratifiedKFold that returns
    stratified folds with non-overlapping groups. The folds are made by
    preserving the percentage of samples for each class.

    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).

    The difference between GroupKFold and StratifiedGroupKFold is that
    the former attempts to create balanced folds such that the number of
    distinct groups is approximately the same in each fold, whereas
    StratifiedGroupKFold attempts to create folds which preserve the
    percentage of samples for each class.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    shuffle : bool, default=False
        Whether to shuffle each class's samples before splitting into batches.
        Note that the samples within each split will not be shuffled.

    random_state : int or RandomState instance, default=None
        When `shuffle` is True, `random_state` affects the ordering of the
        indices, which controls the randomness of each fold for each class.
        Otherwise, leave `random_state` as `None`.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import StratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = StratifiedGroupKFold(n_splits=3)
    >>> for train_idxs, test_idxs in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 6 6 7]
           [1 1 1 0 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 8 8]
           [0 0 1 1 1 0 0]
    TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
           [0 0 1 1 1 1 0 0 0 0 0 0]
     TEST: [2 2 6 6 7]
           [1 1 0 0 0]
    TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
           [0 0 1 1 1 1 1 0 0 0 0 0]
     TEST: [4 5 5 5 5]
           [1 0 0 0 0]

    See also
    --------
    StratifiedKFold: Takes class information into account to build folds which
        retain class distributions (for binary or multiclass classification
        tasks).

    GroupKFold: K-fold iterator variant with non-overlapping groups.
    """

    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        super().__init__(n_splits=n_splits, shuffle=shuffle,
                         random_state=random_state)

    # Implementation based on this kaggle kernel:
    # https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def _iter_test_indices(self, X, y, groups):
        labels_num = np.max(y) + 1
        y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
        y_distr = Counter()
        for label, group in zip(y, groups):
            y_counts_per_group[group][label] += 1
            y_distr[label] += 1

        y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
        groups_per_fold = defaultdict(set)

        groups_and_y_counts = list(y_counts_per_group.items())
        rng = check_random_state(self.random_state)
        if self.shuffle:
            rng.shuffle(groups_and_y_counts)

        for group, y_counts in sorted(groups_and_y_counts,
                                      key=lambda x: -np.std(x[1])):
            best_fold = None
            min_eval = None
            for i in range(self.n_splits):
                y_counts_per_fold[i] += y_counts
                std_per_label = []
                for label in range(labels_num):
                    std_per_label.append(np.std(
                        [y_counts_per_fold[j][label] / y_distr[label]
                         for j in range(self.n_splits)]))
                y_counts_per_fold[i] -= y_counts
                fold_eval = np.mean(std_per_label)
                if min_eval is None or fold_eval < min_eval:
                    min_eval = fold_eval
                    best_fold = i
            y_counts_per_fold[best_fold] += y_counts
            groups_per_fold[best_fold].add(group)

        for i in range(self.n_splits):
            test_indices = [idx for idx, group in enumerate(groups)
                            if group in groups_per_fold[i]]
            yield test_indices


class RepeatedStratifiedGroupKFold(_RepeatedSplits):
    """Repeated Stratified K-Fold cross validator.

    Repeats Stratified K-Fold with non-overlapping groups n times with
    different randomization in each repetition.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    n_repeats : int, default=10
        Number of times cross-validator needs to be repeated.

    random_state : int or RandomState instance, default=None
        Controls the generation of the random states for each repetition.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
    ...                                   random_state=36851234)
    >>> for train_index, test_index in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
    TRAIN: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
     TEST: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
    TRAIN: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]
     TEST: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
    TRAIN: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
     TEST: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]

    Notes
    -----
    Randomized CV splitters may return different results for each call of
    split. You can make the results identical by setting `random_state`
    to an integer.

    See also
    --------
    RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
    """

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        super().__init__(StratifiedGroupKFold, n_splits=n_splits,
                         n_repeats=n_repeats, random_state=random_state)

@hermidalc I'm quite confused of what we've resolved when looking back at this from time to time. (Unfortunately my time is not what it used to be!) Can you give me an idea of what you'd recommend be included in scikit-learn?

@hermidalc I'm quite confused of what we've resolved when looking back at this from time to time. (Unfortunately my time is not what it used to be!) Can you give me an idea of what you'd recommend be included in scikit-learn?

I've wanted to do a better implementation than what I did in #15239. The implementation in that PR works but stratifies on the groups to make the logic straightfoward, though this isn't ideal.

So what I did above (thanks to @mrunibe and kaggle from jakubwasikowski) is a better implementation of StratifiedGroupKFold that stratifies on the samples. I want to port the same logic to do a better StratifiedGroupShuffleSplit and then it will be ready. I will put the new code in #15239 to replace the older implementation.

I apologize regarding my PRs that are unfinished, I'm getting my PhD so never have time!

Thank you @hermidalc and @mrunibe for providing the implementation. I have also been looking for a StratifiedGroupKFold method for dealing with medical data that has strong class imbalance and a greatly varying number of samples per subject. GroupKFold by itself creates training data sub-sets containing only one class.

I want to port the same logic to do a better StratifiedGroupShuffleSplit and then it will be ready.

We could certainly consider merging StratifiedGroupKFold before StratifiedGroupShuffleSplit is ready.

I apologize regarding my PRs that are unfinished, I'm getting my PhD so never have time!

Let us know if you want support completing it!

And good luck with your PhD work

Here are full-fledged versions of StratifiedGroupKFold and RepeatedStratifiedGroupKFold using the code @mrunibe provided which I further simplified and changed a couple things. These classes also follow the design of how other sklearn CV classes of the same type are done.

Is it possible to try this out? I tried to cut and paste with some of the various dependencies but it was never ending. I would love to give this class a try in my project. Just trying to see if there is a way available now to do that.

@hermidalc Hope your PhD work has been succeeded!
I'm looking forwards to see this implement done as well since my PhD work in Geosciences needs this stratification feature with group control. I've spent some hours on implementing this idea of splitting manually on my project. But I gave up finish it due to the same reason...PhD progress. So, I can totally understand how PhD work can torture a person's time. LOL No pressure. For now, I use GroupShuffleSplit as an alternative.

Cheers

@bfeeny @dispink it's very easy to use the two classes I wrote above. Create a file e.g. split.py with the following. Then in your user code if the script is in the same directory as split.py you simply import from split import StratifiedGroupKFold, RepeatedStratifiedGroupKFold

from collections import Counter, defaultdict

import numpy as np

from sklearn.model_selection._split import _BaseKFold, _RepeatedSplits
from sklearn.utils.validation import check_random_state


class StratifiedGroupKFold(_BaseKFold):
    """Stratified K-Folds iterator variant with non-overlapping groups.

    This cross-validation object is a variation of StratifiedKFold that returns
    stratified folds with non-overlapping groups. The folds are made by
    preserving the percentage of samples for each class.

    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).

    The difference between GroupKFold and StratifiedGroupKFold is that
    the former attempts to create balanced folds such that the number of
    distinct groups is approximately the same in each fold, whereas
    StratifiedGroupKFold attempts to create folds which preserve the
    percentage of samples for each class.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    shuffle : bool, default=False
        Whether to shuffle each class's samples before splitting into batches.
        Note that the samples within each split will not be shuffled.

    random_state : int or RandomState instance, default=None
        When `shuffle` is True, `random_state` affects the ordering of the
        indices, which controls the randomness of each fold for each class.
        Otherwise, leave `random_state` as `None`.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import StratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = StratifiedGroupKFold(n_splits=3)
    >>> for train_idxs, test_idxs in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 6 6 7]
           [1 1 1 0 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 8 8]
           [0 0 1 1 1 0 0]
    TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
           [0 0 1 1 1 1 0 0 0 0 0 0]
     TEST: [2 2 6 6 7]
           [1 1 0 0 0]
    TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
           [0 0 1 1 1 1 1 0 0 0 0 0]
     TEST: [4 5 5 5 5]
           [1 0 0 0 0]

    See also
    --------
    StratifiedKFold: Takes class information into account to build folds which
        retain class distributions (for binary or multiclass classification
        tasks).

    GroupKFold: K-fold iterator variant with non-overlapping groups.
    """

    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        super().__init__(n_splits=n_splits, shuffle=shuffle,
                         random_state=random_state)

    # Implementation based on this kaggle kernel:
    # https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def _iter_test_indices(self, X, y, groups):
        labels_num = np.max(y) + 1
        y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
        y_distr = Counter()
        for label, group in zip(y, groups):
            y_counts_per_group[group][label] += 1
            y_distr[label] += 1

        y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
        groups_per_fold = defaultdict(set)

        groups_and_y_counts = list(y_counts_per_group.items())
        rng = check_random_state(self.random_state)
        if self.shuffle:
            rng.shuffle(groups_and_y_counts)

        for group, y_counts in sorted(groups_and_y_counts,
                                      key=lambda x: -np.std(x[1])):
            best_fold = None
            min_eval = None
            for i in range(self.n_splits):
                y_counts_per_fold[i] += y_counts
                std_per_label = []
                for label in range(labels_num):
                    std_per_label.append(np.std(
                        [y_counts_per_fold[j][label] / y_distr[label]
                         for j in range(self.n_splits)]))
                y_counts_per_fold[i] -= y_counts
                fold_eval = np.mean(std_per_label)
                if min_eval is None or fold_eval < min_eval:
                    min_eval = fold_eval
                    best_fold = i
            y_counts_per_fold[best_fold] += y_counts
            groups_per_fold[best_fold].add(group)

        for i in range(self.n_splits):
            test_indices = [idx for idx, group in enumerate(groups)
                            if group in groups_per_fold[i]]
            yield test_indices


class RepeatedStratifiedGroupKFold(_RepeatedSplits):
    """Repeated Stratified K-Fold cross validator.

    Repeats Stratified K-Fold with non-overlapping groups n times with
    different randomization in each repetition.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    n_repeats : int, default=10
        Number of times cross-validator needs to be repeated.

    random_state : int or RandomState instance, default=None
        Controls the generation of the random states for each repetition.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
    ...                                   random_state=36851234)
    >>> for train_index, test_index in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
    TRAIN: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
     TEST: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
    TRAIN: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]
     TEST: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
    TRAIN: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
     TEST: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]

    Notes
    -----
    Randomized CV splitters may return different results for each call of
    split. You can make the results identical by setting `random_state`
    to an integer.

    See also
    --------
    RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
    """

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        super().__init__(StratifiedGroupKFold, n_splits=n_splits,
                         n_repeats=n_repeats, random_state=random_state)

@hermidalc Thank you for the positive reply!
I quickly adopt it as you described. However, I can only get the splits that only have data in the training or test set. As far as I understanding the code description, there is no parameter to specify the proportion between training and test sets, right?
I know it's a conflict between Stratification, group control and datasets proportion... That why I gave up continuing... But maybe we can still find compromising to work around.
image

Sincerely

@hermidalc Thank you for the positive reply!
I quickly adopt it as you described. However, I can only get the splits that only have data in the training or test set. As far as I understanding the code description, there is no parameter to specify the proportion between training and test sets, right?
I know it's a conflict between Stratification, group control and datasets proportion... That why I gave up continuing... But maybe we can still find compromising to work around.

To test I made the split.py and the ran this example in ipython and it works. I've been using these custom CV iterators in my work for a long time and they have no issues on my side. BTW I'm using scikit-learn 0.22.2 not 0.23.x, so not sure if that is the cause of issue. Could you please try to run this example below and see if you can reproduce it? If you can, then it might be something with the y and groups in your work.

In [6]: import numpy as np 
   ...: from split import StratifiedGroupKFold 
   ...:  
   ...: X = np.ones((17, 2)) 
   ...: y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 
   ...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8]) 
   ...: cv = StratifiedGroupKFold(n_splits=3, shuffle=True, random_state=777) 
   ...: for train_idxs, test_idxs in cv.split(X, y, groups): 
   ...:     print("TRAIN:", groups[train_idxs]) 
   ...:     print("      ", y[train_idxs]) 
   ...:     print(" TEST:", groups[test_idxs]) 
   ...:     print("      ", y[test_idxs]) 
   ...:                                                                                                                                                                                                    
TRAIN: [2 2 4 5 5 5 5 6 6 7]
       [1 1 1 0 0 0 0 0 0 0]
 TEST: [1 1 3 3 3 8 8]
       [0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
       [0 0 1 1 1 1 0 0 0 0 0 0]
 TEST: [2 2 6 6 7]
       [1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
       [0 0 1 1 1 1 1 0 0 0 0 0]
 TEST: [4 5 5 5 5]
       [1 0 0 0 0]

There does seem to be regular interest in this feature, @hermidalc, and we
could likely find someone to finish it off if you didn't mind.

@hermidalc 'You have to make sure that every sample in the same group has the same class label.' Obviously that's the problem. My samples in the same group don't share the same class. Mmm...it seems to be another branch of development.
Thank you very much anyway.

@hermidalc 'You have to make sure that every sample in the same group has the same class label.' Obviously that's the problem. My samples in the same group don't share the same class. Mmm...it seems to be another branch of development.
Thank you very much anyway.

Yes this has been discussed in various threads here. It's another more complex use case that is useful, but many like myself don't need that use case currently but needed something with keeps groups together yet stratifies on the samples. The requirement of the code above is that all the samples in each group belong to the same class.

Actually @dispink I was wrong, this algorithm does not require that all members of a group belong to the same class. For example:

In [2]: X = np.ones((17, 2)) 
   ...: y =      np.array([0, 2, 1, 1, 2, 0, 0, 1, 2, 1, 1, 1, 0, 2, 0, 1, 0]) 
   ...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8]) 
   ...: cv = StratifiedGroupKFold(n_splits=3) 
   ...: for train_idxs, test_idxs in cv.split(X, y, groups): 
   ...:     print("TRAIN:", groups[train_idxs]) 
   ...:     print("      ", y[train_idxs]) 
   ...:     print(" TEST:", groups[test_idxs]) 
   ...:     print("      ", y[test_idxs]) 
   ...:                                                                                                                                                                                                    
TRAIN: [1 1 2 2 3 3 3 4 8 8]
       [0 2 1 1 2 0 0 1 1 0]
 TEST: [5 5 5 5 6 6 7]
       [2 1 1 1 0 2 0]
TRAIN: [1 1 4 5 5 5 5 6 6 7 8 8]
       [0 2 1 2 1 1 1 0 2 0 1 0]
 TEST: [2 2 3 3 3]
       [1 1 2 0 0]
TRAIN: [2 2 3 3 3 5 5 5 5 6 6 7]
       [1 1 2 0 0 2 1 1 1 0 2 0]
 TEST: [1 1 4 8 8]
       [0 2 1 1 0]

So I'm not quite sure what is going on with your data, since even with your screenshots you cannot truly see what your data layout is and what might be happening. I would suggest you first reproduce the examples I showed in here to make sure it's not a scikit-learn version issue (since I'm using 0.22.2) and if you can reproduce it then I would suggest you start from small parts of your data and test it. Using ~104k samples is difficult to troubleshoot.

@hermidalc Thank you for the reply!
I actually can reproduce the result above, so I'm troubleshooting with a smaller data now.

+1

Anyone mind if I pick this issue up?
Seems that #15239 together with the https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment-600894432 have an implementation already and only unit tests are left to do.

Was this page helpful?
0 / 5 - 0 ratings