Scikit-learn: 層化GroupKFold

作成日 2019年04月11日 · 48コメント · ソース: scikit-learn/scikit-learn

説明

現在、sklearnには層化されたグループkfold機能がありません。層化を使用するか、グループkfoldを使用できます。ただし、両方があればよいでしょう。

私たちがそれを持っていると決めたら、私はそれを実装したいと思います。

ソース

aditya1702

👍58

最も参考になるコメント

興味のある人が自分のユースケースと、これから本当に何を望んでいるのかを説明できればよいでしょう。

反復測定を行った場合の医学および生物学での非常に一般的な使用例。
例：MR画像からアルツハイマー病（AD）と健康な対照などの病気を分類したいとします。同じ主題について、（フォローアップセッションまたは縦断的データからの）複数のスキャンがある場合があります。合計1000人の被験者がいて、そのうち200人がAD（不均衡なクラス）と診断されていると仮定します。ほとんどの被験者は1回のスキャンを行いますが、一部の被験者では2つまたは3つの画像を使用できます。分類器をトレーニング/テストするときは、データの漏洩を防ぐために、同じ被写体からの画像が常に同じ折り畳みにあることを確認する必要があります。
これにはStratifiedGroupKFoldを使用するのが最適です。層化してクラスの不均衡を考慮しますが、サブジェクトが異なるフォールドに表示されてはならないというグループ制約があります。
NB：それを繰り返し可能にするといいでしょう。

以下の実装例は、 kaggle-kernelに触発されています。

import numpy as np
from collections import Counter, defaultdict
from sklearn.utils import check_random_state

class RepeatedStratifiedGroupKFold():

    def __init__(self, n_splits=5, n_repeats=1, random_state=None):
        self.n_splits = n_splits
        self.n_repeats = n_repeats
        self.random_state = random_state

    # Implementation based on this kaggle kernel:
    #    https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def split(self, X, y=None, groups=None):
        k = self.n_splits
        def eval_y_counts_per_fold(y_counts, fold):
            y_counts_per_fold[fold] += y_counts
            std_per_label = []
            for label in range(labels_num):
                label_std = np.std(
                    [y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]
                )
                std_per_label.append(label_std)
            y_counts_per_fold[fold] -= y_counts
            return np.mean(std_per_label)

        rnd = check_random_state(self.random_state)
        for repeat in range(self.n_repeats):
            labels_num = np.max(y) + 1
            y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
            y_distr = Counter()
            for label, g in zip(y, groups):
                y_counts_per_group[g][label] += 1
                y_distr[label] += 1

            y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
            groups_per_fold = defaultdict(set)

            groups_and_y_counts = list(y_counts_per_group.items())
            rnd.shuffle(groups_and_y_counts)

            for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
                best_fold = None
                min_eval = None
                for i in range(k):
                    fold_eval = eval_y_counts_per_fold(y_counts, i)
                    if min_eval is None or fold_eval < min_eval:
                        min_eval = fold_eval
                        best_fold = i
                y_counts_per_fold[best_fold] += y_counts
                groups_per_fold[best_fold].add(g)

            all_groups = set(groups)
            for i in range(k):
                train_groups = all_groups - groups_per_fold[i]
                test_groups = groups_per_fold[i]

                train_indices = [i for i, g in enumerate(groups) if g in train_groups]
                test_indices = [i for i, g in enumerate(groups) if g in test_groups]

                yield train_indices, test_indices

RepeatedStratifiedKFold （同じグループのサンプルが両方のフォールドに表示される場合があります）とRepeatedStratifiedGroupKFoldの比較：

import matplotlib.pyplot as plt
from sklearn import model_selection

def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=lw, cmap=plt.cm.coolwarm,
                   vmin=-.2, vmax=1.2)

    ax.scatter(range(len(X)), [ii + 1.5] * len(X), c=y, marker='_',
               lw=lw, cmap=plt.cm.Paired)
    ax.scatter(range(len(X)), [ii + 2.5] * len(X), c=group, marker='_',
               lw=lw, cmap=plt.cm.tab20c)

    yticklabels = list(range(n_splits)) + ['class', 'group']
    ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+2.2, -.2], xlim=[0, 100])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)


# demonstration
np.random.seed(1338)
n_splits = 4
n_repeats=5


# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)

percentiles_classes = [.4, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

# Evenly spaced groups
g = np.hstack([[ii] * 5 for ii in range(20)])


fig, ax = plt.subplots(1,2, figsize=(14,4))

cv_nogrp = model_selection.RepeatedStratifiedKFold(n_splits=n_splits,
                                                   n_repeats=n_repeats,
                                                   random_state=1338)
cv_grp = RepeatedStratifiedGroupKFold(n_splits=n_splits,
                                      n_repeats=n_repeats,
                                      random_state=1338)

plot_cv_indices(cv_nogrp, X, y, g, ax[0], n_splits * n_repeats)
plot_cv_indices(cv_grp, X, y, g, ax[1], n_splits * n_repeats)

plt.show()

RepeatedStratifiedGroupKFold_demo

mrunibe 2019年11月23日

👍24

全てのコメント48件

@TomDLT @NicolasHugどう思いますか？

aditya1702 2019年04月12日

理論的には興味深いかもしれませんが、実際にどれほど役立つかはわかりません。私たちは確かに問題を開いたままにして、何人の人々がこの機能を要求するかを見ることができます

NicolasHug 2019年04月13日

各グループが単一のクラスにあると思いますか？

jnothman 2019年04月15日

＃9413も参照してください

jnothman 2019年04月15日

@jnothmanはい、私は同じようなことを考えていました。ただし、プルリクエストはまだ開いているようです。私は、グループがフォールドを越えて繰り返されないことを意味しました。グループとしてIDがある場合、同じIDが複数のフォールドにまたがって発生することはありません

aditya1702 2019年04月15日

👍5

これはRFECVの使用に関連していることを理解しています。
現在、これはデフォルトでStratifiedKFoldcvを使用します。そのfit（）もgroups =を取ります
ただし、fit（）を実行するときにグループが尊重されないようです。警告なし（バグと見なされる場合があります）。

グループ化と階層化は、レコード間の依存関係がある非常に不均衡なデータセットに役立ちます
（私の場合、同じ個人が複数のレコードを持っていますが、分割の数に比べてまだ多数のグループ=人がいます。少数派クラスの一意のグループの数がどこかに近づくにつれて、実際的な問題があると思います分割数）。

だから：+1！

arc12 2019年04月29日

👍1

これは間違いなく便利です。たとえば、非常に不均衡な時系列の医療データを使用して、患者を分離しますが、（ほぼ）各フォールドで不均衡なクラスのバランスを取ります。

また、StratifiedKFoldはグループをパラメーターとして受け取りますが、グループに従ってグループ化しないため、フラグを立てる必要があることもわかりました。

jambo6 2019年05月13日

👍9

この機能のもう1つの良い使い方は、財務データです。これは通常、非常に不均衡です。私の場合、同じエンティティ（異なる時点）の複数のレコードを持つ非常に不均衡なデータセットがあります。リークを回避するためにGroupKFoldを実行しますが、不均衡が大きいために、ポジティブがほとんどまたはまったくないグループになる可能性があるため、階層化します。

guillermo-carrasco 2019年08月06日

👍8

＃14524も参照してください。

amueller 2019年08月08日

👍1

層化GroupShuffleSplitおよびGroupKFoldのもう1つの使用例は、生物学的「反復測定」設計です。この設計では、被験者または他の親の生物学的単位ごとに複数のサンプルがあります。また、生物学の多くの実世界のデータセットには、クラスの不均衡があります。サンプルの各グループには同じクラスがあります。したがって、グループを階層化してまとめることが重要です。

hermidalc 2019年10月11日

👍8

説明
現在、sklearnには層化されたグループkfold機能がありません。層化を使用するか、グループkfoldを使用できます。ただし、両方があればよいでしょう。
私たちがそれを持っていると決めたら、私はそれを実装したいと思います。

こんにちは、私はそれが医学MLに非常に役立つと思います。すでに実装されていますか？

jvel07 2019年11月11日

@amueller人々がこれに興味を持っていることを考えると、これを実装する必要があると思いますか？

aditya1702 2019年11月11日

❤6

私も非常に興味があります...サンプルごとに複数の複製測定値がある場合、分光法で非常に役立ちます。交差検定の間、それらは実際に同じフォールドにとどまる必要があります。また、分類しようとしている不均衡なクラスがいくつかある場合は、階層化機能も使用する必要があります。したがって、私もそれに投票します！申し訳ありませんが、私は開発に参加するのに十分ではありませんが、参加する人にとっては、それが使用されることを確信できます:-)
すべてのチームに賛成です。ありがとう！

fcoppey 2019年11月12日

👍3

少なくともStratifiedGroupKFoldで作業が試みられているため、このスレッドで参照されている問題とPRを確認してください。私はすでにStratifiedGroupShuffleSplit ＃15239を実行しましたが、これはテストが必要ですが、私はすでに自分の作業にかなり使用しています。

hermidalc 2019年11月12日

実装すべきだと思いますが、実際に何が欲しいのかはまだわかりません。 @hermidalcには、同じグループのメンバーが同じクラスでなければならないという制限があります。それは一般的なケースではありませんよね？

興味のある人が自分のユースケースと、これから本当に何を望んでいるのかを説明できればよいでしょう。

＃15239＃14524と＃9413がありますが、これらはすべて異なるセマンティクスを持っていることを覚えています。

amueller 2019年11月12日

👍1

@amuellerはあなたに完全に同意します。今日、利用可能なさまざまなバージョン（＃15239＃14524と＃9413）の間で何かを探していましたが、これらのいずれかが私のニーズに合うかどうかを本当に理解できませんでした。それで、それが役立つなら、これが私のユースケースです：
私は1000のサンプルを持っています。各サンプルはNIR分光計で3回測定されているので、各サンプルには3つの複製があり、ずっと一緒にいたいと思います...
これらの1000個のサンプルは、それぞれに非常に異なる数のサンプルを持つ6つの異なるクラスに属しています。
クラス1：400サンプル
クラス2：300サンプル
クラス3：100サンプル
クラス4：100サンプル
クラス5：70サンプル
クラス6：30サンプル
クラスごとに分類器を作成したいと思います。したがって、クラス1と他のすべてのクラス、次にクラス2と他のすべてのクラスなどです。
各分類子の精度を最大化するには、各フォールドに6つのクラスのサンプルを表示することが重要です。これは、クラスにそれほど違いがないため、常に6つのクラスを表示するための正確な境界線を作成するのに役立ちます。各フォールドで。

これが、層化された（常に各フォールドで表される私の6つのクラス）グループ（常に私の各サンプルの3つの複製メジャーを一緒に保つ）kfoldが私がここで探しているものであると私が信じる理由です。
何か意見はありますか？

fcoppey 2019年11月12日

👍3

私のユースケースとStratifiedGroupShuffleSplitを作成した理由は、反復測定デザインhttps://en.wikipedia.org/wiki/Repeated_measures_designをサポートするためです。私のユースケースでは、同じグループのメンバーは同じクラスでなければなりません。

hermidalc 2019年11月12日

👍3

@fcoppeyあなたにとって、グループ内のサンプルは常に同じクラスを持っていますよね？

@hermidalc私はこの用語にあまり精通していませんが、ウィキペディアから「反復測定デザイン」は、「クロスオーバー試験には反復測定デザインがあり、同じクラス内に同じグループが含まれている必要がある」という意味ではないようです。各患者は2つ以上の治療のシーケンスに割り当てられ、そのうちの1つは標準治療またはプラセボである可能性があります。」
これをML設定に関連付けると、個人が治療を受けたばかりかプラセボを受けたかを測定値から予測するか、治療を受けた結果を予測することができます。
どちらの場合も、同じ個人のクラスが変わる可能性がありますよね？

名前に関係なく、クロスオーバー試験で説明されているのと同様のケースについて考えていたときに、どちらも同じユースケースを持っているように思えます。あるいは、もう少し単純なことかもしれません。時間の経過とともに患者が病気になる（または良くなる）可能性があるため、患者の転帰が変わる可能性があります。

amueller 2019年11月13日

👍1

実際、リンク先のウィキペディアの記事には、「縦断分析-反復測定デザインにより、研究者は長期および短期の両方の状況で参加者が時間の経過とともにどのように変化するかを監視できます」と明示的に記載されているため、クラスの変更が含まれていると思います。
同じ条件で測定が行われることを意味する別の単語がある場合、その単語を使用できますか？

amueller 2019年11月13日

👍1

@amuellerはい、その通りです。このデザインのユースケースでは、一般的なユースケースではなく、上記の誤った書き方をしていることに気付きました。

反復測定の設計には非常に複雑なタイプが多数ありStratifiedGroupShuffleSplitが、2つのタイプでは、グループ内で同じクラス制限が適用されます（治療反応を予測する際の治療前後の縦断サンプリング、複数の前治療）治療反応を予測する際の、異なる身体位置での被験者ごとのサンプル）。

すぐに機能するものが必要だったので、他の人が使用したり、sklearnで何かを始めたりできるようにしたいと思いました。さらに、間違いがなければ、グループ内のクラスラベルが異なる場合は、階層化ロジックの設計がより複雑になります。

hermidalc 2019年11月13日

@amuellerはい常にそうです。これらは、予測にデバイスの内部変動性を含めるための同じ測定値の複製です。

fcoppey 2019年11月13日

👍2

@hermidalcはい、この場合ははるかに簡単です。それが一般的なニーズである場合、私たちはそれを追加してうれしいです。名前から、それが何をするのかがある程度明確であることを確認する必要があります。また、これら2つのバージョンが同じクラスに存在する必要があるかどうかを検討する必要があります。

StratifiedKFoldにこれを行わせるのは非常に簡単なはずです。 2つのオプションがあります。各フォールドに同じ数のサンプルが含まれていることを確認するか、各フォールドに同じ数のグループが含まれていることを確認します。
2番目の方法は簡単です（各グループが単一のポイントであると偽ってStratifiedKFoldに渡すだけです）。それはあなたがあなたのPRでしていることです、それはのように見えます。

GroupKFold最初に最小のフォールドに追加することで、ヒューリスティックに2つをトレードオフすると思います。それが層化されたケースにどのように変換されるかはわかりませんので、あなたのアプローチを使用して満足しています。

同じPRにGroupStratifiedKFoldも追加する必要がありますか？それとも後でそれを残しますか？
他のPRの目標は少し異なります。誰かがさまざまなユースケースが何であるかを書き留めることができれば、それは良いことです（私はおそらく今は時間がありません）。

amueller 2019年11月13日

👍2

すべてのサンプルが同じクラスを持つグループ制約を個別に処理する場合は+1。

jnothman 2019年11月14日

@hermidalcはい、この場合ははるかに簡単です。それが一般的なニーズである場合、私たちはそれを追加してうれしいです。名前から、それが何をするのかがある程度明確であることを確認する必要があります。また、これら2つのバージョンが同じクラスに存在する必要があるかどうかを検討する必要があります。

私はこれを完全には理解していません。各グループのメンバーを異なるクラスにすることができるStratifiedGroupShuffleSplitとStratifiedGroupKFoldは、ユーザーがすべてのグループメンバーを指定したときに、まったく同じ分割動作をする必要があります。同じクラスの。後で内部を改善することができ、既存の動作は同じになるのはいつですか？

2番目の方法は簡単です（各グループが単一のポイントであると偽ってStratifiedKFoldに渡すだけです）。それはあなたがあなたのPRでしていることです、それはのように見えます。
GroupKFold最初に最小のフォールドに追加することで、ヒューリスティックに2つをトレードオフすると思います。それが層化されたケースにどのように変換されるかはわかりませんので、あなたのアプローチを使用して満足しています。
同じPRにGroupStratifiedKFoldも追加する必要がありますか？それとも後でそれを残しますか？
他のPRの目標は少し異なります。誰かがさまざまなユースケースが何であるかを書き留めることができれば、それは良いことです（私はおそらく今は時間がありません）。

使用した「各グループの単一サンプル」アプローチを使用して、 StatifiedGroupKFoldを追加します。

hermidalc 2019年11月14日

興味のある人が自分のユースケースと、これから本当に何を望んでいるのかを説明できればよいでしょう。

以下の実装例は、 kaggle-kernelに触発されています。

import numpy as np
from collections import Counter, defaultdict
from sklearn.utils import check_random_state

class RepeatedStratifiedGroupKFold():

    def __init__(self, n_splits=5, n_repeats=1, random_state=None):
        self.n_splits = n_splits
        self.n_repeats = n_repeats
        self.random_state = random_state

    # Implementation based on this kaggle kernel:
    #    https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def split(self, X, y=None, groups=None):
        k = self.n_splits
        def eval_y_counts_per_fold(y_counts, fold):
            y_counts_per_fold[fold] += y_counts
            std_per_label = []
            for label in range(labels_num):
                label_std = np.std(
                    [y_counts_per_fold[i][label] / y_distr[label] for i in range(k)]
                )
                std_per_label.append(label_std)
            y_counts_per_fold[fold] -= y_counts
            return np.mean(std_per_label)

        rnd = check_random_state(self.random_state)
        for repeat in range(self.n_repeats):
            labels_num = np.max(y) + 1
            y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
            y_distr = Counter()
            for label, g in zip(y, groups):
                y_counts_per_group[g][label] += 1
                y_distr[label] += 1

            y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
            groups_per_fold = defaultdict(set)

            groups_and_y_counts = list(y_counts_per_group.items())
            rnd.shuffle(groups_and_y_counts)

            for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
                best_fold = None
                min_eval = None
                for i in range(k):
                    fold_eval = eval_y_counts_per_fold(y_counts, i)
                    if min_eval is None or fold_eval < min_eval:
                        min_eval = fold_eval
                        best_fold = i
                y_counts_per_fold[best_fold] += y_counts
                groups_per_fold[best_fold].add(g)

            all_groups = set(groups)
            for i in range(k):
                train_groups = all_groups - groups_per_fold[i]
                test_groups = groups_per_fold[i]

                train_indices = [i for i, g in enumerate(groups) if g in train_groups]
                test_indices = [i for i, g in enumerate(groups) if g in test_groups]

                yield train_indices, test_indices

RepeatedStratifiedKFold （同じグループのサンプルが両方のフォールドに表示される場合があります）とRepeatedStratifiedGroupKFoldの比較：

import matplotlib.pyplot as plt
from sklearn import model_selection

def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=lw, cmap=plt.cm.coolwarm,
                   vmin=-.2, vmax=1.2)

    ax.scatter(range(len(X)), [ii + 1.5] * len(X), c=y, marker='_',
               lw=lw, cmap=plt.cm.Paired)
    ax.scatter(range(len(X)), [ii + 2.5] * len(X), c=group, marker='_',
               lw=lw, cmap=plt.cm.tab20c)

    yticklabels = list(range(n_splits)) + ['class', 'group']
    ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+2.2, -.2], xlim=[0, 100])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)


# demonstration
np.random.seed(1338)
n_splits = 4
n_repeats=5


# Generate the class/group data
n_points = 100
X = np.random.randn(100, 10)

percentiles_classes = [.4, .6]
y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])

# Evenly spaced groups
g = np.hstack([[ii] * 5 for ii in range(20)])


fig, ax = plt.subplots(1,2, figsize=(14,4))

cv_nogrp = model_selection.RepeatedStratifiedKFold(n_splits=n_splits,
                                                   n_repeats=n_repeats,
                                                   random_state=1338)
cv_grp = RepeatedStratifiedGroupKFold(n_splits=n_splits,
                                      n_repeats=n_repeats,
                                      random_state=1338)

plot_cv_indices(cv_nogrp, X, y, g, ax[0], n_splits * n_repeats)
plot_cv_indices(cv_grp, X, y, g, ax[1], n_splits * n_repeats)

plt.show()

RepeatedStratifiedGroupKFold_demo

mrunibe 2019年11月23日

👍24

stratifiedGroupKfoldの場合は+1。サムルトウォッチからセンサーを取り出して、高齢者の転倒を検出しようとしています。落下データがあまりないため、さまざまなクラスを取得するさまざまな時計を使用してシミュレーションを実行します。また、データをトレーニングする前に、データの拡張も行います。各データポイントから9つのポイントを作成します-これはグループです。説明されているように、グループがトレーニングとテストの両方に参加しないことが重要です

RachelOwl 2020年01月23日

StratifiedGroupKFoldも使用できるようにしたいと思います。私は金融危機を予測するためのデータセットを見ています。ここで、各危機の数年前、後、および最中は独自のグループです。トレーニングおよび相互検証中、各グループのメンバーはフォールド間でリークしないようにする必要があります。

limjiayi 2020年01月25日

マルチラベルシナリオ（Multilabel_
stratifiedGroupKfold）？

mohammadmoein 2020年02月24日

このために+1。スパムのユーザーアカウントを分析しているので、ユーザーごとにグループ化しますが、スパムの発生率は比較的低いため、層別化します。私たちのユースケースでは、一度スパムを送信したユーザーはすべてのデータでスパマーとしてフラグが立てられるため、グループメンバーは常に同じラベルを持ちます。

philip-iv 2020年03月04日

👍1

ドキュメントを組み立てるための古典的なユースケースを提供してくれてありがとう、
@ philip-iv！

jnothman 2020年03月04日

StratifiedGroupShuffleSplitと同じPR＃15239にStratifiedGroupKFoldの実装を追加しました。

PRでわかるように、両方のロジックはhttps://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment -557802602よりもはるかに単純です。これは、私のグループが各クラス（サンプルのパーセンテージではない）。これにより、既存のStratifiedKFoldおよびStratifiedShuffleSplitコードを、一意のグループ情報を渡すことで活用できます。ただし、どちらの実装でも、各グループのサンプルが同じフォールドに一緒にとどまるフォールドが生成されます。

https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment-557802602に基づいたより洗練された方法に投票しますが

hermidalc 2020年03月18日

提供されたコード@mrunibeを使用したStratifiedGroupKFoldとRepeatedStratifiedGroupKFoldの本格的なバージョンを次に示します。これをさらに簡略化して、いくつか変更しました。これらのクラスは、同じタイプの他のsklearnCVクラスがどのように実行されるかの設計にも準拠しています。

class StratifiedGroupKFold(_BaseKFold):
    """Stratified K-Folds iterator variant with non-overlapping groups.

    This cross-validation object is a variation of StratifiedKFold that returns
    stratified folds with non-overlapping groups. The folds are made by
    preserving the percentage of samples for each class.

    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).

    The difference between GroupKFold and StratifiedGroupKFold is that
    the former attempts to create balanced folds such that the number of
    distinct groups is approximately the same in each fold, whereas
    StratifiedGroupKFold attempts to create folds which preserve the
    percentage of samples for each class.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    shuffle : bool, default=False
        Whether to shuffle each class's samples before splitting into batches.
        Note that the samples within each split will not be shuffled.

    random_state : int or RandomState instance, default=None
        When `shuffle` is True, `random_state` affects the ordering of the
        indices, which controls the randomness of each fold for each class.
        Otherwise, leave `random_state` as `None`.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import StratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = StratifiedGroupKFold(n_splits=3)
    >>> for train_idxs, test_idxs in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 6 6 7]
           [1 1 1 0 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 8 8]
           [0 0 1 1 1 0 0]
    TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
           [0 0 1 1 1 1 0 0 0 0 0 0]
     TEST: [2 2 6 6 7]
           [1 1 0 0 0]
    TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
           [0 0 1 1 1 1 1 0 0 0 0 0]
     TEST: [4 5 5 5 5]
           [1 0 0 0 0]

    See also
    --------
    StratifiedKFold: Takes class information into account to build folds which
        retain class distributions (for binary or multiclass classification
        tasks).

    GroupKFold: K-fold iterator variant with non-overlapping groups.
    """

    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        super().__init__(n_splits=n_splits, shuffle=shuffle,
                         random_state=random_state)

    # Implementation based on this kaggle kernel:
    # https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def _iter_test_indices(self, X, y, groups):
        labels_num = np.max(y) + 1
        y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
        y_distr = Counter()
        for label, group in zip(y, groups):
            y_counts_per_group[group][label] += 1
            y_distr[label] += 1

        y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
        groups_per_fold = defaultdict(set)

        groups_and_y_counts = list(y_counts_per_group.items())
        rng = check_random_state(self.random_state)
        if self.shuffle:
            rng.shuffle(groups_and_y_counts)

        for group, y_counts in sorted(groups_and_y_counts,
                                      key=lambda x: -np.std(x[1])):
            best_fold = None
            min_eval = None
            for i in range(self.n_splits):
                y_counts_per_fold[i] += y_counts
                std_per_label = []
                for label in range(labels_num):
                    std_per_label.append(np.std(
                        [y_counts_per_fold[j][label] / y_distr[label]
                         for j in range(self.n_splits)]))
                y_counts_per_fold[i] -= y_counts
                fold_eval = np.mean(std_per_label)
                if min_eval is None or fold_eval < min_eval:
                    min_eval = fold_eval
                    best_fold = i
            y_counts_per_fold[best_fold] += y_counts
            groups_per_fold[best_fold].add(group)

        for i in range(self.n_splits):
            test_indices = [idx for idx, group in enumerate(groups)
                            if group in groups_per_fold[i]]
            yield test_indices


class RepeatedStratifiedGroupKFold(_RepeatedSplits):
    """Repeated Stratified K-Fold cross validator.

    Repeats Stratified K-Fold with non-overlapping groups n times with
    different randomization in each repetition.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    n_repeats : int, default=10
        Number of times cross-validator needs to be repeated.

    random_state : int or RandomState instance, default=None
        Controls the generation of the random states for each repetition.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
    ...                                   random_state=36851234)
    >>> for train_index, test_index in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
    TRAIN: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
     TEST: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
    TRAIN: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]
     TEST: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
    TRAIN: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
     TEST: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]

    Notes
    -----
    Randomized CV splitters may return different results for each call of
    split. You can make the results identical by setting `random_state`
    to an integer.

    See also
    --------
    RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
    """

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        super().__init__(StratifiedGroupKFold, n_splits=n_splits,
                         n_repeats=n_repeats, random_state=random_state)

hermidalc 2020年03月18日

👍9

@hermidalc時々これを振り返ると、私たちが解決したことについてかなり混乱しています。（残念ながら、私の時間は以前とは異なります！）scikit-learnに含めることをお勧めするものについて教えてください。

jnothman 2020年04月19日

@hermidalc時々これを振り返ると、私たちが解決したことについてかなり混乱しています。（残念ながら、私の時間は以前とは異なります！）scikit-learnに含めることをお勧めするものについて教えてください。

＃15239で行ったよりも優れた実装をしたいと思っていました。そのPRでの実装は機能しますが、論理を単純化するためにグループを階層化しますが、これは理想的ではありません。

したがって、上記で行ったこと（jakubwasikowskiの@mrunibeとkaggleのおかげで）は、サンプルを階層化するStratifiedGroupKFoldのより良い実装です。同じロジックを移植してより良いStratifiedGroupShuffleSplitを実行したいので、準備が整います。古い実装を置き換えるために、新しいコードを＃15239に配置します。

未完成のPRについてお詫び申し上げます。博士号を取得しているので、時間がありません。

hermidalc 2020年04月19日

👍1

実装を提供してくれた@hermidalcと@mrunibeに感謝します。また、クラスの不均衡が強く、被験者ごとのサンプル数が大きく異なる医療データを処理するためのStratifiedGroupKFoldメソッドを探していました。 GroupKFold 、それ自体で、1つのクラスのみを含むトレーニングデータサブセットを作成します。

s96lam 2020年05月10日

同じロジックを移植してより良いStratifiedGroupShuffleSplitを実行したいので、準備が整います。

StratifiedGroupShuffleSplitの準備が整う前に、$ StratifiedGroupKFoldをマージすることを検討できます。

未完成のPRについてお詫び申し上げます。博士号を取得しているので、時間がありません。

サポートが必要な場合はお知らせください。

jnothman 2020年05月10日

そしてあなたの博士号の仕事で頑張ってください

jnothman 2020年05月10日

提供されたコード@mrunibeを使用したStratifiedGroupKFoldとRepeatedStratifiedGroupKFoldの本格的なバージョンを次に示します。これをさらに簡略化して、いくつか変更しました。これらのクラスは、同じタイプの他のsklearnCVクラスがどのように実行されるかの設計にも準拠しています。

これを試すことは可能ですか？さまざまな依存関係のいくつかを使用してカットアンドペーストを試みましたが、終了しませんでした。このクラスを私のプロジェクトで試してみたいと思います。それを行うために今利用できる方法があるかどうかを確認しようとしています。

bfeeny 2020年07月05日

@hermidalc博士号取得が成功したことを願っています！
地球科学の博士号取得にはグループ制御を備えたこの階層化機能が必要なので、この実装も行われるのを楽しみにしています。プロジェクトで手動で分割するというこのアイデアの実装に数時間を費やしました。しかし、私は同じ理由でそれを終えることをあきらめました...博士号の進歩。ですから、博士号の仕事がどのように人の時間を苦しめることができるかを完全に理解することができます。笑プレッシャーなし。今のところ、代わりにGroupShuffleSplitを使用しています。

乾杯

dispink 2020年07月09日

@ bfeeny @ dispink上記の2つのクラスを使用するのは非常に簡単です。次のようなファイルを作成します（例： split.py 。次に、ユーザーコードで、スクリプトがsplit.pyと同じディレクトリにある場合は、 from split import StratifiedGroupKFold, RepeatedStratifiedGroupKFoldをインポートするだけです。

from collections import Counter, defaultdict

import numpy as np

from sklearn.model_selection._split import _BaseKFold, _RepeatedSplits
from sklearn.utils.validation import check_random_state


class StratifiedGroupKFold(_BaseKFold):
    """Stratified K-Folds iterator variant with non-overlapping groups.

    This cross-validation object is a variation of StratifiedKFold that returns
    stratified folds with non-overlapping groups. The folds are made by
    preserving the percentage of samples for each class.

    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).

    The difference between GroupKFold and StratifiedGroupKFold is that
    the former attempts to create balanced folds such that the number of
    distinct groups is approximately the same in each fold, whereas
    StratifiedGroupKFold attempts to create folds which preserve the
    percentage of samples for each class.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    shuffle : bool, default=False
        Whether to shuffle each class's samples before splitting into batches.
        Note that the samples within each split will not be shuffled.

    random_state : int or RandomState instance, default=None
        When `shuffle` is True, `random_state` affects the ordering of the
        indices, which controls the randomness of each fold for each class.
        Otherwise, leave `random_state` as `None`.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import StratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = StratifiedGroupKFold(n_splits=3)
    >>> for train_idxs, test_idxs in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 6 6 7]
           [1 1 1 0 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 8 8]
           [0 0 1 1 1 0 0]
    TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
           [0 0 1 1 1 1 0 0 0 0 0 0]
     TEST: [2 2 6 6 7]
           [1 1 0 0 0]
    TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
           [0 0 1 1 1 1 1 0 0 0 0 0]
     TEST: [4 5 5 5 5]
           [1 0 0 0 0]

    See also
    --------
    StratifiedKFold: Takes class information into account to build folds which
        retain class distributions (for binary or multiclass classification
        tasks).

    GroupKFold: K-fold iterator variant with non-overlapping groups.
    """

    def __init__(self, n_splits=5, shuffle=False, random_state=None):
        super().__init__(n_splits=n_splits, shuffle=shuffle,
                         random_state=random_state)

    # Implementation based on this kaggle kernel:
    # https://www.kaggle.com/jakubwasikowski/stratified-group-k-fold-cross-validation
    def _iter_test_indices(self, X, y, groups):
        labels_num = np.max(y) + 1
        y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
        y_distr = Counter()
        for label, group in zip(y, groups):
            y_counts_per_group[group][label] += 1
            y_distr[label] += 1

        y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
        groups_per_fold = defaultdict(set)

        groups_and_y_counts = list(y_counts_per_group.items())
        rng = check_random_state(self.random_state)
        if self.shuffle:
            rng.shuffle(groups_and_y_counts)

        for group, y_counts in sorted(groups_and_y_counts,
                                      key=lambda x: -np.std(x[1])):
            best_fold = None
            min_eval = None
            for i in range(self.n_splits):
                y_counts_per_fold[i] += y_counts
                std_per_label = []
                for label in range(labels_num):
                    std_per_label.append(np.std(
                        [y_counts_per_fold[j][label] / y_distr[label]
                         for j in range(self.n_splits)]))
                y_counts_per_fold[i] -= y_counts
                fold_eval = np.mean(std_per_label)
                if min_eval is None or fold_eval < min_eval:
                    min_eval = fold_eval
                    best_fold = i
            y_counts_per_fold[best_fold] += y_counts
            groups_per_fold[best_fold].add(group)

        for i in range(self.n_splits):
            test_indices = [idx for idx, group in enumerate(groups)
                            if group in groups_per_fold[i]]
            yield test_indices


class RepeatedStratifiedGroupKFold(_RepeatedSplits):
    """Repeated Stratified K-Fold cross validator.

    Repeats Stratified K-Fold with non-overlapping groups n times with
    different randomization in each repetition.

    Read more in the :ref:`User Guide <cross_validation>`.

    Parameters
    ----------
    n_splits : int, default=5
        Number of folds. Must be at least 2.

    n_repeats : int, default=10
        Number of times cross-validator needs to be repeated.

    random_state : int or RandomState instance, default=None
        Controls the generation of the random states for each repetition.
        Pass an int for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.

    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.model_selection import RepeatedStratifiedGroupKFold
    >>> X = np.ones((17, 2))
    >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
    >>> cv = RepeatedStratifiedGroupKFold(n_splits=2, n_repeats=2,
    ...                                   random_state=36851234)
    >>> for train_index, test_index in cv.split(X, y, groups):
    ...     print("TRAIN:", groups[train_idxs])
    ...     print("      ", y[train_idxs])
    ...     print(" TEST:", groups[test_idxs])
    ...     print("      ", y[test_idxs])
    TRAIN: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
     TEST: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
    TRAIN: [1 1 3 3 3 6 6 7]
           [0 0 1 1 1 0 0 0]
     TEST: [2 2 4 5 5 5 5 8 8]
           [1 1 1 0 0 0 0 0 0]
    TRAIN: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]
     TEST: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
    TRAIN: [1 1 2 2 5 5 5 5 6 6]
           [0 0 1 1 0 0 0 0 0 0]
     TEST: [3 3 3 4 7 8 8]
           [1 1 1 1 0 0 0]

    Notes
    -----
    Randomized CV splitters may return different results for each call of
    split. You can make the results identical by setting `random_state`
    to an integer.

    See also
    --------
    RepeatedStratifiedKFold: Repeats Stratified K-Fold n times.
    """

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        super().__init__(StratifiedGroupKFold, n_splits=n_splits,
                         n_repeats=n_repeats, random_state=random_state)

hermidalc 2020年07月09日

@hermidalc肯定的な返信ありがとうございます！
あなたが説明したように、私はすぐにそれを採用します。ただし、トレーニングセットまたはテストセットのデータのみを含む分割のみを取得できます。コードの説明を理解している限り、トレーニングセットとテストセットの比率を指定するパラメーターはありませんよね？
階層化、グループ制御、データセットの比率の間の競合であることを私は知っています...それで私は継続をあきらめました...しかし、おそらく私たちは回避するために妥協を見つけることができます。

心から

dispink 2020年07月09日

@hermidalc肯定的な返信ありがとうございます！
あなたが説明したように、私はすぐにそれを採用します。ただし、トレーニングセットまたはテストセットのデータのみを含む分割のみを取得できます。コードの説明を理解している限り、トレーニングセットとテストセットの比率を指定するパラメーターはありませんよね？
階層化、グループ制御、データセットの比率の間の競合であることを私は知っています...それで私は継続をあきらめました...しかし、おそらく私たちは回避するために妥協を見つけることができます。

テストするために、 split.pyを作成し、この例をipythonで実行すると、機能します。私は長い間これらのカスタムCVイテレーターを仕事で使用してきましたが、問題はありません。ところで、私は0.23.xではなくscikit-learn 0.22.2を使用しているので、それが問題の原因であるかどうかはわかりません。以下の例を実行して、再現できるかどうかを確認してください。可能であれば、それはあなたの仕事にyとgroupsが含まれているものかもしれません。

In [6]: import numpy as np 
   ...: from split import StratifiedGroupKFold 
   ...:  
   ...: X = np.ones((17, 2)) 
   ...: y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 
   ...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8]) 
   ...: cv = StratifiedGroupKFold(n_splits=3, shuffle=True, random_state=777) 
   ...: for train_idxs, test_idxs in cv.split(X, y, groups): 
   ...:     print("TRAIN:", groups[train_idxs]) 
   ...:     print("      ", y[train_idxs]) 
   ...:     print(" TEST:", groups[test_idxs]) 
   ...:     print("      ", y[test_idxs]) 
   ...:                                                                                                                                                                                                    
TRAIN: [2 2 4 5 5 5 5 6 6 7]
       [1 1 1 0 0 0 0 0 0 0]
 TEST: [1 1 3 3 3 8 8]
       [0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
       [0 0 1 1 1 1 0 0 0 0 0 0]
 TEST: [2 2 6 6 7]
       [1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
       [0 0 1 1 1 1 1 0 0 0 0 0]
 TEST: [4 5 5 5 5]
       [1 0 0 0 0]

hermidalc 2020年07月09日

この機能、 @ hermidalcには定期的に関心があるようですが、
あなたが気にしなければ、誰かがそれを終わらせるのを見つけるかもしれません。

jnothman 2020年07月09日

👍2

@hermidalc '同じグループ内のすべてのサンプルが同じクラスラベルを持っていることを確認する必要があります。明らかにそれが問題です。同じグループの私のサンプルは同じクラスを共有していません。うーん...それは開発の別のブランチのようです。
とにかくありがとうございました。

dispink 2020年07月09日

@hermidalc '同じグループ内のすべてのサンプルが同じクラスラベルを持っていることを確認する必要があります。明らかにそれが問題です。同じグループの私のサンプルは同じクラスを共有していません。うーん...それは開発の別のブランチのようです。
とにかくありがとうございました。
はい、これはここのさまざまなスレッドで議論されています。これは便利なもう1つのより複雑なユースケースですが、私のような多くの人は現在そのユースケースを必要としませんが、グループをまとめながらサンプルを階層化する何かが必要です。上記のコードの要件は、各グループのすべてのサンプルが同じクラスに属していることです。

実際、 @ dispinkは間違っていました。このアルゴリズムでは、グループのすべてのメンバーが同じクラスに属している必要はありません。例えば：

In [2]: X = np.ones((17, 2)) 
   ...: y =      np.array([0, 2, 1, 1, 2, 0, 0, 1, 2, 1, 1, 1, 0, 2, 0, 1, 0]) 
   ...: groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8]) 
   ...: cv = StratifiedGroupKFold(n_splits=3) 
   ...: for train_idxs, test_idxs in cv.split(X, y, groups): 
   ...:     print("TRAIN:", groups[train_idxs]) 
   ...:     print("      ", y[train_idxs]) 
   ...:     print(" TEST:", groups[test_idxs]) 
   ...:     print("      ", y[test_idxs]) 
   ...:                                                                                                                                                                                                    
TRAIN: [1 1 2 2 3 3 3 4 8 8]
       [0 2 1 1 2 0 0 1 1 0]
 TEST: [5 5 5 5 6 6 7]
       [2 1 1 1 0 2 0]
TRAIN: [1 1 4 5 5 5 5 6 6 7 8 8]
       [0 2 1 2 1 1 1 0 2 0 1 0]
 TEST: [2 2 3 3 3]
       [1 1 2 0 0]
TRAIN: [2 2 3 3 3 5 5 5 5 6 6 7]
       [1 1 2 0 0 2 1 1 1 0 2 0]
 TEST: [1 1 4 8 8]
       [0 2 1 1 0]

したがって、スクリーンショットを使用しても、データレイアウトが何であり、何が起こっているのかを実際に確認することはできないため、データで何が起こっているのかよくわかりません。ここで示した例を最初に再現して、scikit-learnバージョンの問題ではないことを確認することをお勧めします（0.22.2を使用しているため）。再現できる場合は、データとそれをテストします。〜104kのサンプルを使用すると、トラブルシューティングが困難になります。

hermidalc 2020年07月09日

@hermidalc返信ありがとうございます！
上記の結果を実際に再現できるので、現在はより小さなデータでトラブルシューティングを行っています。

dispink 2020年07月10日

GustavoGianotti 2020年09月25日

私がこの問題を取り上げてもいいですか？
＃15239とhttps://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment -600894432にはすでに実装があり、単体テストのみを実行する必要があるようです。

marrodion 2020年10月04日

👍5

このページは役に立ちましたか？

0 / 5 - 0 評価

Scikit-learn: 層化GroupKFold

説明

最も参考になるコメント

全てのコメント48件

説明

関連する問題