Evalml: ์‹ค์ ์ด ์ €์กฐํ•œ ์Šคํƒ ์•™์ƒ๋ธ”

์— ๋งŒ๋“  2021๋…„ 04์›” 06์ผ  ยท  11์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: alteryx/evalml

์žฌํ˜„ ๋‹จ๊ณ„:

  1. evalml์— ํ–‰๋ณต ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋กœ๋“œ
  2. ์•™์ƒ๋ธ”์„ ํฌํ•จํ•  ๋งŒํผ ์ถฉ๋ถ„ํžˆ ์˜ค๋ž˜ ์‹คํ–‰
  3. ๊ธฐ์ค€ ํšŒ๊ท€๋ถ„์„์€ ๋ˆ„์  ํšŒ๊ท€๋ถ„์„๋ณด๋‹ค ์ˆœ์œ„๊ฐ€ ๋†’์€ ๊ฒƒ์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.
    ํ–‰๋ณต ๋ฐ์ดํ„ฐ Full Set.csv.zip
bug performance

๋ชจ๋“  11 ๋Œ“๊ธ€

@dancuarini ๋‚˜๋Š” ์ด๊ฒƒ์„ ๋กœ์ปฌ์—์„œ ์žฌํ˜„ํ•˜๋ ค๊ณ ํ–ˆ์ง€๋งŒ ํ•  ์ˆ˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. AutoMLSearch๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์ „์— ์ถ”๊ฐ€ ๋‹จ๊ณ„(์˜ˆ: ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ํฌ๊ธฐ, ์—ด ์‚ญ์ œ) ๋•Œ๋ฌธ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์ œ ๊ตฌ์„ฑ์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐํ•ฉ์‹œ๋‹ค!

๋กœ์ปฌ์—์„œ ์‹คํ–‰ํ•˜๋ ค๊ณ  ์‹œ๋„ํ•œ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

from evalml.automl import AutoMLSearch
import pandas as pd
import woodwork as ww
from evalml.automl.callbacks import raise_error_callback

happiness_data_set = pd.read_csv("Happiness Data Full Set.csv")
y = happiness_data_set['Happiness']
X = happiness_data_set.drop(['Happiness'], axis=1)
# display(X.head())

X = ww.DataTable(X)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='regression', test_size=0.2, random_seed=0)
# print(X.types)

automl = AutoMLSearch(X, y, problem_type="regression", objective="MAE", error_callback=raise_error_callback, max_batches=20, ensembling=True)
automl.search()

๊ทธ ๊ฒฐ๊ณผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ˆœ์œ„๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

image

ํ˜„์žฌ ์ง„ํ–‰ ์ƒํ™ฉ: @dancuarini ์™€ ๋กœ์ปฌ์—์„œ ๋ฌธ์ œ ์— ๋Œ€ํ•ด ๋…ผ์˜ํ–ˆ์œผ๋ฉฐ @Cmancuso ์™€ ๊ณ„์† ์—ฐ๋ฝํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@angela97lin ์ž ๊น๋งŒ,

์žฌ์ƒ์‚ฐ ๊ณต์œ ํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)

@dsherry ์Šคํƒํ˜• ์•™์ƒ๋ธ”๋Ÿฌ๊ฐ€ ๋งจ ์œ„์— ์žˆ์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์ด ์•ฝ๊ฐ„ ์˜์‹ฌ

@angela97lin ์•„ ๋„ค ์ดํ•ดํ–ˆ์Šต๋‹ˆ๋‹ค! ๋‚ด๊ฐ€ ๋‹น์‹ ์—๊ฒŒ ๋ช‡ ๊ฐ€์ง€ ๋ฉ”๋ชจ๋ฅผ ๋ณด๋ƒˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ์šฐ๋ฆฌ์˜ ์•™์ƒ๋ธ”์ด ํ•ญ์ƒ ์ •์ƒ์— ๊ฐ€๊น์ง€ ์•Š๋‹ค๋Š” ์ฆ๊ฑฐ๊ฐ€ ๋ฌธ์ œ๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์„ ์กฐ๊ธˆ ๋” ํŒŒ์‹ญ์‹œ์˜ค. ์•™์ƒ๋ธ”๋Ÿฌ๊ฐ€ ์ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๋ฐ์—๋Š” ๋ช‡ ๊ฐ€์ง€ ์ž ์žฌ์ ์ธ ์ด์œ ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

  1. ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์ •๋ง ์ž‘์œผ๋ฉฐ ํ˜„์žฌ ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ์ „๋žต์€ ์•™์ƒ๋ธ”๋Ÿฌ์— ๋งค์šฐ ์ž‘์€ ๋ฐ์ดํ„ฐ ํ•˜์œ„ ์ง‘ํ•ฉ์ด ์ œ๊ณต๋˜๊ณ  ๊ฒ€์ฆ๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ง€๊ธˆ ๋‹น์žฅ ์Šคํƒํ˜• ์•™์ƒ๋ธ”๋Ÿฌ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๋ ค๋ฉด ์•™์ƒ๋ธ”๋Ÿฌ๊ฐ€ ํ›ˆ๋ จํ•  ๋ฐ์ดํ„ฐ( ensembling_indices ์‹๋ณ„)๋ฅผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์ด ์ด๋ฏธ ํ›ˆ๋ จ๋œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ธˆ์† ํ•™์Šต๊ธฐ๋ฅผ ํ›ˆ๋ จํ•˜์—ฌ ์•™์ƒ๋ธ”๋Ÿฌ๋ฅผ ๊ณผ์ ํ•ฉํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ํ•˜๋‚˜์˜ CV ๋ถ„ํ• ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ensembling_indices ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. 128๊ฐœ ํ–‰์˜ ์ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ๊ฐ๊ฐ 17๊ฐœ ๋ฐ 8๊ฐœ ํ–‰์— ๋Œ€ํ•ด ํ•™์Šตํ•˜๊ณ  ์œ ํšจ์„ฑ์„ ๊ฒ€์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด ์ถ”๊ฐ€ CV ๋ถ„ํ• ์„ ์ˆ˜ํ–‰ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๋…ผ์˜ํ•˜๊ธฐ ์œ„ํ•ด #2144๋ฅผ ์ œ์ถœํ–ˆ์Šต๋‹ˆ๋‹ค.

  2. ์šฐ๋ฆฌ์˜ ์•™์ƒ๋ธ”์€ ํ˜„์žฌ ๋ฐœ๊ฒฌ๋œ ๊ฐ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ์˜ ์ตœ์ƒ์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ์„ ํƒํ•˜๊ณ  ์ด๋ฅผ ์Šคํƒํ˜• ์•™์ƒ๋ธ”์˜ ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ถ€ ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์˜ ์„ฑ๋Šฅ์ด ๋งค์šฐ ์ €์กฐํ•˜๋ฉด ์Šคํƒ ์•™์ƒ๋ธ”๋Ÿฌ๊ฐ€ ๊ณ ์„ฑ๋Šฅ ๊ฐœ๋ณ„ ํŒŒ์ดํ”„๋ผ์ธ๋งŒํผ ์„ฑ๋Šฅ์„ โ€‹โ€‹๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ๋‹ค์Œ์€ ์ตœ์ข… ์ˆœ์œ„ ํ…Œ์ด๋ธ”์ž…๋‹ˆ๋‹ค.
image

์šฐ๋ฆฌ๋Š” stacked ensemble์ด ์ค‘๊ฐ„์— ์˜ฌ๋ฐ”๋ฅธ smack์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„์ฐจ๋ ธ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ๋‹จ์ˆœํ™”ํ•˜๊ณ  stacked ensemble์ด ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์˜ ์˜ˆ์ธก์„ ํ‰๊ท ํ™”ํ•œ๋‹ค๊ณ  ๋งํ•˜๋ฉด ์ด๊ฒƒ์€ ์˜๋ฏธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‚ด ๊ฐ€์„ค์„ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ๊ฐ€ ์•„๋‹ˆ๋ผ ์Šคํƒํ˜• ์•™์ƒ๋ธ”๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋” ์ข‹์€ ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ๋งŒ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ๊ณ  ๊ฒฐ๊ณผ ์ ์ˆ˜๊ฐ€ ๊ฐœ๋ณ„ ํŒŒ์ดํ”„๋ผ์ธ๋ณด๋‹ค ํ›จ์”ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•˜์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ €๋กœ ํ•˜์—ฌ๊ธˆ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์€ ๊ฐœ๋ณ„ ํŒŒ์ดํ”„๋ผ์ธ์ด ์Šคํƒ ์•™์ƒ๋ธ”๋Ÿฌ์˜ ์„ฑ๋Šฅ์„ ์•…ํ™”์‹œ์ผฐ๋‹ค๊ณ  ๋ฏฟ๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

์ด์— ๋Œ€ํ•œ ์žฌํ˜„ ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์œ„์—์„œ:

import pandas as pd
import woodwork as ww
happiness_data_set = pd.read_csv("Happiness Data Full Set.csv")
y = happiness_data_set['Happiness']
X = happiness_data_set.drop(['Happiness'], axis=1)

X = ww.DataTable(X)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='regression', test_size=0.25, random_seed=0)

automl = AutoMLSearch(X, y, problem_type="regression", objective="MAE", error_callback=raise_error_callback, max_batches=10, ensembling=True)
automl.search()

import woodwork as ww
from evalml.automl.engine import train_and_score_pipeline
from evalml.automl.engine.engine_base import JobLogger

# Get the pipelines fed into the ensemble but only use the ones better than the stacked ensemble
input_pipelines = []
input_info = automl._automl_algorithm._best_pipeline_info
from evalml.model_family import ModelFamily

trimmed = dict()
trimmed.update({ModelFamily.RANDOM_FOREST: input_info[ModelFamily.RANDOM_FOREST]})
trimmed.update({ModelFamily.XGBOOST: input_info[ModelFamily.XGBOOST]})
trimmed.update({ModelFamily.DECISION_TREE: input_info[ModelFamily.EXTRA_TREES]})

for pipeline_dict in trimmed.values():
    pipeline_class = pipeline_dict['pipeline_class']
    pipeline_params = pipeline_dict['parameters']
    input_pipelines.append(pipeline_class(parameters=automl._automl_algorithm._transform_parameters(pipeline_class, pipeline_params),
                                                      random_seed=automl._automl_algorithm.random_seed))
ensemble_pipeline = _make_stacked_ensemble_pipeline(input_pipelines, "regression")
X_train = X.iloc[automl.ensembling_indices]
y_train = ww.DataColumn(y.iloc[automl.ensembling_indices])
train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger())

์ด ์„ธ ๊ฐ€์ง€ ๋ชจ๋ธ ์ œํ’ˆ๊ตฐ์„ ์‚ฌ์šฉํ•˜๋ฉด ~0.22์˜ MAE ์ ์ˆ˜๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” ๊ฐœ๋ณ„ ํŒŒ์ดํ”„๋ผ์ธ๋ณด๋‹ค ํ›จ์”ฌ ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.

#output of train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger())
{'scores': {'cv_data': [{'all_objective_scores': OrderedDict([('MAE',
                  0.22281276417465426),
                 ('ExpVariance', 0.9578811127332543),
                 ('MaxError', 0.3858477236606914),
                 ('MedianAE', 0.2790362808260225),
                 ('MSE', 0.0642654425375983),
                 ('R2', 0.9152119239698017),
                 ('Root Mean Squared Error', 0.2535062968401343),
                 ('# Training', 17),
                 ('# Validation', 9)]),
    'mean_cv_score': 0.22281276417465426,
    'binary_classification_threshold': None}],
  'training_time': 9.944366216659546,
  'cv_scores': 0    0.222813
  dtype: float64,
  'cv_score_mean': 0.22281276417465426},
 'pipeline': TemplatedPipeline(parameters={'Stacked Ensemble Regressor':{'input_pipelines': [GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Regressor':{'n_estimators': 184, 'max_depth': 25, 'n_jobs': -1},}), GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'XGBoost Regressor':{'eta': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100},}), GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Extra Trees Regressor':{'n_estimators': 100, 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1},})], 'final_estimator': None, 'cv': None, 'n_jobs': -1},}),

์ด๊ฒƒ์€ ์Šคํƒํ˜• ์•™์ƒ๋ธ”๋Ÿฌ์— ์–ด๋–ค ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ณต๊ธ‰ํ•ด์•ผ ํ•˜๋Š”์ง€ ๋‹ค์‹œ ์ƒ๊ฐํ•ด์•ผ ํ•˜๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

  1. ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” metalearner(LinearRegressor)๋Š” ์ตœ๊ณ ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค. ๊ธฐ๋ณธ metalearner๋ฅผ RidgeCV(scikit-learn ๊ธฐ๋ณธ๊ฐ’์ด์ง€๋งŒ EvalML์—๋Š” ์—†์Œ)๋กœ ์—…๋ฐ์ดํŠธํ•œ ๊ณณ์—์„œ ์ƒ์„ฑํ•œ stacking_test ๋ถ„๊ธฐ๋ฅผ ํ†ตํ•ด ํ…Œ์ŠคํŠธํ–ˆ์œผ๋ฉฐ ์•™์ƒ๋ธ”๋Ÿฌ๊ฐ€ ํ›จ์”ฌ ๋” ์ž˜ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
    image

@dsherry์™€์˜ ๋…ผ์˜ ํ›„ ๋‹ค์Œ ๋‹จ๊ณ„:

๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ #1 ๋ฐ #3(Elastic Net ์‚ฌ์šฉ)์„ ์‹œ๋„ํ•˜๊ณ  ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์ „๋ฐ˜์ ์œผ๋กœ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.

@angela97lin ์ž‘์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ๋ถ„ํ• ์— ๋Œ€ํ•œ ๊ท€ํ•˜์˜ ์š”์ ์€ ๋ชฉํ‘œ์— ๋งž์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ์šฐ๋ฆฌ๋Š” ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์„ ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ •๋ง ๋‹ค๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด LOOCV๋ฅผ ํฌํ•จํ•œ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ํด๋“œ ์ˆ˜๊ฐ€ ๋งŽ์€ xval๋งŒ ์‚ฌ์šฉํ•˜๊ณ  ์•™์ƒ๋ธ” ๊ธˆ์† ํ•™์Šต๊ธฐ ํ›ˆ๋ จ์„ ์œ„ํ•ด ํด๋“œ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ๊ตฌ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ๋˜ํ•œ metalearner๊ฐ€ ๊ฐ•๋ ฅํ•œ ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค๋Š” ๋ฐ ๋™์˜ํ•ฉ๋‹ˆ๋‹ค. ์ €๋Š” H2O-3 StackedEnsemble์—์„œ Elastic Net์„ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ ์•™์ƒ๋ธ”์ด ๋ฆฌ๋”๋ณด๋“œ์—์„œ 2์œ„๋ฅผ ํ•œ ๊ธฐ์–ต์ด ๋”ฑ ํ•œ ๋ฒˆ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งค๋ฒˆ ํ…Œ์ŠคํŠธํ•  ๋•Œ๋งˆ๋‹ค 1์œ„์˜€์Šต๋‹ˆ๋‹ค. ์ •๊ทœํ™”๋Š” ์—ด์•…ํ•œ ๋ชจ๋ธ์ด ์•™์ƒ๋ธ”์˜ ์„ฑ๋Šฅ์„ ๋–จ์–ด๋œจ๋ฆฌ๋Š” ๊ฒƒ์„ ํ—ˆ์šฉํ•ด์„œ๋Š” ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์€ ์‹ฌ์ง€์–ด 50๊ฐœ ๋ชจ๋ธ์˜ ์ „์ฒด ์ˆœ์œ„ํ‘œ๋ฅผ metalearner์— ๊ณต๊ธ‰ํ•˜๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. :-)

์ด๊ฒƒ์— ๋Œ€ํ•œ ๋ช‡ ๊ฐ€์ง€ ์ถ”๊ฐ€ ์—…๋ฐ์ดํŠธ๋ฅผ ๊ฒŒ์‹œํ•˜์‹ญ์‹œ์˜ค.

๋ชจ๋“  ํšŒ๊ท€ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ์ปฌ์—์„œ ํ…Œ์ŠคํŠธํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค ์—ฌ๊ธฐ ํ•˜๊ฑฐ๋‚˜ ์ฐจํŠธ ์—ฌ๊ธฐ .

์ด๊ฒƒ์œผ๋กœ๋ถ€ํ„ฐ:

  • @rpeck์— ๋™์˜ํ–ˆ์Šต๋‹ˆ๋‹ค! ๊ฐ•๋ ฅํ•œ ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก metalearner๋ฅผ ์—…๋ฐ์ดํŠธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ElasticNetCV๋Š” ๋งŽ์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ LinearRegressor๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋Š” https://github.com/alteryx/evalml/issues/1739๋ฅผ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค.
  • @dsherry ์™€ ์ €๋Š” ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ์ „๋žต์— ๋Œ€ํ•ด ๋‹ค์‹œ ๋…ผ์˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ ์ง€๊ธˆ, ์šฐ๋ฆฌ๋Š” ์•™์ƒ๋ธ”์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ์€ ์šฐ๋ฆฌ๊ฐ€ ์ด ์•™์ƒ๋ธ” ์ธ๋ฑ์Šค์— ๋Œ€ํ•ด metalearner ๊ฐ€ ํ›ˆ๋ จ๋˜๊ธฐ๋ฅผ ์›ํ•œ๋‹ค๋Š” ๊ฐ€์ • ํ•˜์— ์žˆ์Šต๋‹ˆ๋‹ค. scikit-learn ๊ตฌํ˜„์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด ์•™์ƒ๋ธ” ์ธ๋ฑ์Šค ๋ถ„ํ• ์— ๋Œ€ํ•ด StackedEnsembler ๋ฅผ ํ›ˆ๋ จํ•  ๋•Œ ์ด ์ž‘์€ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ metalearner๋ฅผ ํ›ˆ๋ จํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ์šฐ๋ฆฌ๊ฐ€ ์ž˜ ์ˆ˜ํ–‰ํ•˜์ง€ ๋ชปํ•˜๋Š” ์ด์œ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠœ๋‹ํ•œ ๊ฒƒ์ด์ง€๋งŒ ์ด๋Ÿฌํ•œ ํŒŒ์ดํ”„๋ผ์ธ์€ ์ ํ•ฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์žฅ๊ธฐ์ ์œผ๋กœ ์ž์ฒด ๊ตฌํ˜„์„ ๋กค๋งํ•˜๋ฉด ํ›ˆ๋ จ๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์•™์ƒ๋ธ”๋Ÿฌ์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด ๊ฒฝ์šฐ ์›ํ•˜๋Š” ๋™์ž‘์„ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ๋กœ์„œ๋Š” ๊ทธ๋ ‡์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: ์•™์ƒ๋ธ”๋Ÿฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜๋™์œผ๋กœ ์ด ๊ฐ€์„ค์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์˜ 80%์— ๋Œ€ํ•ด ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ˆ˜๋™์œผ๋กœ ํ›ˆ๋ จํ•˜๊ณ , ์•™์ƒ๋ธ”์„ ์œ„ํ•ด ๋”ฐ๋กœ ์„ค์ •ํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ต์ฐจ ๊ฒ€์ฆ๋œ ์˜ˆ์ธก์„ ์ƒ์„ฑํ•˜๊ณ , ์˜ˆ์ธก์„ ์ดˆ๊ณผํ•˜๋Š” ๊ธˆ์† ํ•™์Šต๊ธฐ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค์‹ญ์‹œ์˜ค.

์‹คํ—˜ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์•„ ๋ณด์ž…๋‹ˆ๋‹ค: https://alteryx.quip.com/4hEyAaTBZDap/Ensembling-Performance-Using-More-Data

๋‹ค์Œ ๋‹จ๊ณ„:

์กฐ์‚ฌํ•œ ๊ฒฐ๊ณผ ๋ฌธ์ œ๋Š” ์•™์ƒ๋ธ”์˜ ์„ฑ๋Šฅ์ด ์•„๋‹ˆ๋ผ ์•™์ƒ๋ธ”์˜ ์„ฑ๋Šฅ์„ ๋ณด๊ณ ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ ๋ฐ์ดํ„ฐ์˜ 20%์— ํ•ด๋‹นํ•˜๋Š” ๋ณ„๋„์˜ ์•™์ƒ๋ธ” ๋ถ„ํ• ์„ ์ˆ˜ํ–‰ํ•œ ๋‹ค์Œ ๋‹ค๋ฅธ train-validation ๋ถ„ํ• ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์•™์ƒ๋ธ”์˜ ์ ์ˆ˜๋ฅผ validation ๋ฐ์ดํ„ฐ๋กœ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์–ด๋–ค ๊ฒฝ์šฐ์—๋Š” ์•™์ƒ๋ธ” ์ ์ˆ˜๊ฐ€ ๋งค์šฐ ์ ์€ ์ˆ˜์˜ ํ–‰์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๋œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค(์œ„์˜ ํ–‰๋ณต ๋ฐ์ดํ„ฐ ์„ธํŠธ์™€ ๊ฐ™์ด).

์•™์ƒ๋ธ” ์ธ๋ฑ์Šค ๋ถ„ํ• ์„ ์ œ๊ฑฐํ•˜๊ณ  ์•™์ƒ๋ธ”์— ๋Œ€ํ•œ cv ํ•™์Šต ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ด์ „ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ(๋ชจ๋“  ๋ฐ์ดํ„ฐ ์ œ๊ณต, ํ•™์Šต ๋ฐ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ๋ฅผ ํ•œ ๊ฒน์œผ๋กœ ์ˆ˜ํ–‰), ์•™์ƒ๋ธ”์ด ๊ฑฐ์˜ ๋ชจ๋“  ๊ฒฝ์šฐ์—์„œ ๋” ๋†’์€ ์ˆœ์œ„์— ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ๊ฒฝ์šฐ์— 1์œ„์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ๊ฒ€์ฆ ์ ์ˆ˜๋Š” ๋™์ผํ•˜๊ฑฐ๋‚˜ ์•ฝ๊ฐ„ ๋” ์ข‹์Šต๋‹ˆ๋‹ค.

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์€ ํ›ˆ๋ จ๋˜์ง€ ์•Š๊ณ  ์•™์ƒ๋ธ”์€ ์ž…๋ ฅ ํŒŒ์ดํ”„๋ผ์ธ์˜ ์˜ˆ์ธก๊ฐ’๋งŒ ์ž…๋ ฅ์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ์ ํ•ฉ์€ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ž์ฒด ์•™์ƒ๋ธ” ๊ตฌํ˜„์„ ๋‹ค์‹œ ๋ฐฉ๋ฌธํ•˜๊ณ  ๋ถ„ํ•  ์ „๋žต์„ ์—…๋ฐ์ดํŠธํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ง€๊ธˆ์€ ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ์ „๋žต๊ณผ scikit-learn์˜ ๊ตฌํ˜„์„ ๋ณ€๊ฒฝํ•˜์—ฌ ๊ฐœ์„  ์‚ฌํ•ญ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์•™์ƒ๋ธ”์ด ํ™œ์„ฑํ™”๋œ ๊ฒฝ์šฐ ์ ํ•ฉ ์‹œ๊ฐ„์„ ์ฆ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค. ๋ชจ๋“  ํŒŒ์ดํ”„๋ผ์ธ์€ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ณ (์˜ˆ์•ฝ๋œ ์•™์ƒ๋ธ” ์ธ๋ฑ์Šค ์—†์Œ) ์•™์ƒ๋ธ”์€ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ด๊ฒƒ์ด ๊ดœ์ฐฎ๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

๊ฒฐ๊ณผ ํ‘œ: https://alteryx.quip.com/jI2mArnWZfTU/Ensembling-vs-Best-Pipeline-Validation-Scores#MKWACADlCDt

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰