Evalml: 스택 앙상블: automl의 나머지 부분과 동일한 CV 데이터 스플리터 사용

에 만든 2020년 12월 22일 · 4코멘트 · 출처: alteryx/evalml

문제
현재 스택 앙상블에는 CV에 대한 자체 설정이 있습니다. IterativeAlgorithm 는 _make_stacked_ensembler util 을 호출하지만 현재 automl 검색에서 데이터 스플리터를 통해 스레드하지 않습니다.

스택형 앙상블러에 기본적으로 설정된 데이터 스플리터 는 shuffle=True 설정하지 않습니다. 이는 입력 데이터 세트에 순서가 있는 경우 성능이 저하될 수 있습니다 . 또한 n_folds 와 같은 다른 매개변수에 대해 동일한 설정을 갖지 않습니다. 이는 이상적이지 않습니다.

또한 이 차이 로 인해 sklearn 0.24.0 을 지원하지 못합니다 . 이 문제를 수정하면 해당 버전을 지원할 수 있습니다.

고치다
automl이 IterativeAlgorithm 통해 데이터 스플리터를 스택 앙상블러로 전달하도록 합시다.

bug

출처

dsherry

모든 4 댓글

@angela97lin 내 설명이 이해가 됩니까? / 스태킹을 설정할 때 이것을 하지 않기로 선택한 이유가 있었나요? :)

dsherry 에 2020년 12월 22일

@dsherry 나는 당신의 설명이 의미가 있다고 생각합니다! IIRC는 스태킹을 설정하고 더 성능을 높이려고 할 때 / 스태킹을 더 빠르게 실행하려고 할 때 너무 많이 접히지 않은 것으로 기본 설정하고 싶었습니다. 따라서 self._default_cv(n_splits=3, random_state=random_state) 라인 기본값은 scikit-learn에 의해 지정되고 n_splits 는 3으로 하드코딩됩니다.

angela97lin 에 2020년 12월 22일

👍1

이에 대해 좀 더 파고들어 AutoML에서 사용하는 데이터 분할 방법을 스택 앙상블 구성 요소에 짜려고 했습니다. 그러나 이 문제가 발생했습니다( TrainingValidationSplit 클래스를 작동시키는 데 필요한 API 업데이트를 처리한 후).


estimator = WrappedSKClassifier(pipeline=LogisticRegressionBinaryPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'm...Logistic Regression Classifier':{'penalty': 'l2', 'C': 1.0, 'n_jobs': -1, 'multi_class': 'auto', 'solver': 'lbfgs'},}))
X =            0         1         2         3         4
0   0.965469  0.041236  0.028701  0.659165  0.213375
1   0.043831...978  0.079577
48  0.376344  0.920154  0.314640  0.180086  0.197598
49  0.682661  0.046529  0.400513  0.412513  0.751464
y = array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0])

    <strong i="7">@_deprecate_positional_args</strong>
    def cross_val_predict(estimator, X, y=None, *, groups=None, cv=None,
                          n_jobs=None, verbose=0, fit_params=None,
                          pre_dispatch='2*n_jobs', method='predict'):
        """Generate cross-validated estimates for each input data point

        The data is split according to the cv parameter. Each sample belongs
        to exactly one test set, and its prediction is computed with an
        estimator fitted on the corresponding training set.

        Passing these predictions into an evaluation metric may not be a valid
        way to measure generalization performance. Results can differ from
        :func:`cross_validate` and :func:`cross_val_score` unless all tests sets
        have equal size and the metric decomposes over samples.

        Read more in the :ref:`User Guide <cross_validation>`.

        Parameters
        ----------
        estimator : estimator object implementing 'fit' and 'predict'
            The object to use to fit the data.

        X : array-like of shape (n_samples, n_features)
            The data to fit. Can be, for example a list, or an array at least 2d.

        y : array-like of shape (n_samples,) or (n_samples, n_outputs), \
                default=None
            The target variable to try to predict in the case of
            supervised learning.

        groups : array-like of shape (n_samples,), default=None
            Group labels for the samples used while splitting the dataset into
            train/test set. Only used in conjunction with a "Group" :term:`cv`
            instance (e.g., :class:`GroupKFold`).

        cv : int, cross-validation generator or an iterable, default=None
            Determines the cross-validation splitting strategy.
            Possible inputs for cv are:

            - None, to use the default 5-fold cross validation,
            - int, to specify the number of folds in a `(Stratified)KFold`,
            - :term:`CV splitter`,
            - An iterable yielding (train, test) splits as arrays of indices.

            For int/None inputs, if the estimator is a classifier and ``y`` is
            either binary or multiclass, :class:`StratifiedKFold` is used. In all
            other cases, :class:`KFold` is used.

            Refer :ref:`User Guide <cross_validation>` for the various
            cross-validation strategies that can be used here.

            .. versionchanged:: 0.22
                ``cv`` default value if None changed from 3-fold to 5-fold.

        n_jobs : int, default=None
            Number of jobs to run in parallel. Training the estimator and
            predicting are parallelized over the cross-validation splits.
            ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
            ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
            for more details.

        verbose : int, default=0
            The verbosity level.

        fit_params : dict, defualt=None
            Parameters to pass to the fit method of the estimator.

        pre_dispatch : int or str, default='2*n_jobs'
            Controls the number of jobs that get dispatched during parallel
            execution. Reducing this number can be useful to avoid an
            explosion of memory consumption when more jobs get dispatched
            than CPUs can process. This parameter can be:

                - None, in which case all the jobs are immediately
                  created and spawned. Use this for lightweight and
                  fast-running jobs, to avoid delays due to on-demand
                  spawning of the jobs

                - An int, giving the exact number of total jobs that are
                  spawned

                - A str, giving an expression as a function of n_jobs,
                  as in '2*n_jobs'

        method : {'predict', 'predict_proba', 'predict_log_proba', \
                  'decision_function'}, default='predict'
            The method to be invoked by `estimator`.

        Returns
        -------
        predictions : ndarray
            This is the result of calling `method`. Shape:

                - When `method` is 'predict' and in special case where `method` is
                  'decision_function' and the target is binary: (n_samples,)
                - When `method` is one of {'predict_proba', 'predict_log_proba',
                  'decision_function'} (unless special case above):
                  (n_samples, n_classes)
                - If `estimator` is :term:`multioutput`, an extra dimension
                  'n_outputs' is added to the end of each shape above.

        See Also
        --------
        cross_val_score : Calculate score for each CV split.
        cross_validate : Calculate one or more scores and timings for each CV
            split.

        Notes
        -----
        In the case that one or more classes are absent in a training portion, a
        default score needs to be assigned to all instances for that class if
        ``method`` produces columns per class, as in {'decision_function',
        'predict_proba', 'predict_log_proba'}.  For ``predict_proba`` this value is
        0.  In order to ensure finite output, we approximate negative infinity by
        the minimum finite float value for the dtype in other cases.

        Examples
        --------
        >>> from sklearn import datasets, linear_model
        >>> from sklearn.model_selection import cross_val_predict
        >>> diabetes = datasets.load_diabetes()
        >>> X = diabetes.data[:150]
        >>> y = diabetes.target[:150]
        >>> lasso = linear_model.Lasso()
        >>> y_pred = cross_val_predict(lasso, X, y, cv=3)
        """
        X, y, groups = indexable(X, y, groups)

        cv = check_cv(cv, y, classifier=is_classifier(estimator))
        splits = list(cv.split(X, y, groups))

        test_indices = np.concatenate([test for _, test in splits])
        if not _check_is_permutation(test_indices, _num_samples(X)):
>           raise ValueError('cross_val_predict only works for partitions')
E           ValueError: cross_val_predict only works for partitions

../venv/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:845: ValueError

다음을 호출하려고 할 때 발생하는 오류입니다.

clf = StackedEnsembleClassifier(input_pipelines=[logistic_regression_binary_pipeline_class(parameters={})], cv=TrainingValidationSplit())
    clf.fit(X, y)

그 이유는 scikit-learn이 전달된 cv가 실제로 교차 검증 방법임을 검증하기 때문입니다. TrainingValidationSplit 와 같은 단일 분할에는 만족하지 않습니다. 여기서 일부 데이터는 테스트 데이터에 절대 포함되지 않습니다(하나의 분할만 있기 때문에).

따라서 현재로서는 scikit-learn 0.24를 지원하고 기본 cv의 shuffle=True 설정하는 것이 가장 좋은 계획이라고 생각합니다. 우리는 이것이 유용한 일이라고 생각한다면 이것을 다시 살펴볼 수 있습니다. 생각, @dsherry?

angela97lin 에 2020년 12월 28일

1593은 이 문제로 더 이상 차단되지 않아야 합니다. 0.24.0에 필요한 것이 #1613에서 해결되었어야 했기 때문입니다.

angela97lin 에 2021년 01월 20일

👍1

이 페이지가 도움이 되었나요?

0 / 5 - 0 등급

Evalml: 스택 앙상블: automl의 나머지 부분과 동일한 CV 데이터 스플리터 사용

모든 4 댓글

1593은 이 문제로 더 이상 차단되지 않아야 합니다. 0.24.0에 필요한 것이 #1613에서 해결되었어야 했기 때문입니다.

관련 문제