Evalml: BalancedClassificationDataCVSplit์€ ํ˜ธ์ถœ๋  ๋•Œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๋ถ„ํ• ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2021๋…„ 03์›” 16์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: alteryx/evalml

๋ ˆํ”„๋กœ

import joblib
from evalml.demos import load_fraud
from evalml.preprocessing.data_splitters import BalancedClassificationDataCVSplit

splitter = BalancedClassificationDataCVSplit(n_splits=3, random_seed=0, shuffle=True)

X, y = load_fraud(5000)
X = X.to_dataframe()
y = y.to_series().astype("int")

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('75f1b95d7ce307ac6c793055330969aa', '8c89fe1a592c50a700b6d5cbb02dba8b')
('f8c849bbfbed37c13f66c5c742e237cb', '9c4879fb550fded8be9ac03e95a1bf95')
('cdc21f0d6bbf45459c9695258f7f04dc', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('bf462b82af243c552ac48acad2dfd748', '8c89fe1a592c50a700b6d5cbb02dba8b')
('b8341b536c63c7957c099b05e315f49c', '9c4879fb550fded8be9ac03e95a1bf95')
('780e74673b601790037fc0b17dde56fe', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in splitter.split(X, y):
    print((joblib.hash(train), joblib.hash(test)

# Output
('385f6c538568ad3a33cf84f61d94144c', '8c89fe1a592c50a700b6d5cbb02dba8b')
('8db65d0a3bdf87ae0f135b9766a260dd', '9c4879fb550fded8be9ac03e95a1bf95')
('2a7293fc1308b8a572091d7c76d20205', '5b575765bbe176e732b8eb4dc1bf2822')

์ด๊ฒƒ์€ sklearn ์Šคํ”Œ๋ฆฌํ„ฐ์˜ ๋™์ž‘๊ณผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=3, random_state=0, shuffle=True)

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

#Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

for train, test in kfold.split(X, y):
    print((joblib.hash(train), joblib.hash(test)))

# Output
('6c30ee6a11803927024354405389506a', '8c89fe1a592c50a700b6d5cbb02dba8b')
('df0a70e2e6ca783f12461e8c82a26ad4', '9c4879fb550fded8be9ac03e95a1bf95')
('2898e4b3d3621b436641016499f4aafb', '5b575765bbe176e732b8eb4dc1bf2822')

๋‚˜๋Š” ์ด๊ฒƒ์ด ๋‘ ๊ฐ€์ง€ ์ด์œ ๋กœ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

  1. BalancedClassificationDataCVSplit์€ automl์˜ ๊ธฐ๋ณธ ์Šคํ”Œ๋ฆฌํ„ฐ์ด๋ฏ€๋กœ ํŒŒ์ดํ”„๋ผ์ธ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„ํ• ์—์„œ ํ‰๊ฐ€๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  2. split ๋Š” ๋ฐ์ดํ„ฐ ์Šคํ”Œ๋ฆฌํ„ฐ์˜ ์ƒํƒœ๋ฅผ ์ˆ˜์ •ํ•˜๋ฏ€๋กœ ์ˆœ์ฐจ ์—”์ง„๊ณผ ๋ณ‘๋ ฌ ์—”์ง„ ๊ฐ„์— ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

๋ชจ๋“  3 ๋Œ“๊ธ€

์ง€์ ํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๊ฐœ์ธ์ ์œผ๋กœ, ์ด ํ–‰๋™์€ ์ €๋ฅผ ๊ดด๋กญํžˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํŠน์ • ์‹œ๋“œ๋กœ ์ดˆ๊ธฐํ™”ํ•  ๋•Œ๋งˆ๋‹ค ๊ทธ ์‹œ์  ์ดํ›„์— ๋™์ผํ•œ ์ถœ๋ ฅ ์‹œํ€€์Šค๋ฅผ ์–ป๋Š” ํ•œ ๋ฌธ์ œ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋ฌด์ž‘์œ„ ์‹œ๋“œ๋ฅผ ์กด์ค‘ํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด ๊ฑฑ์ •ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ์€ ์ด ๋ฌธ์ œ๊ฐ€ ์ถ”์ ํ•˜๋Š” ๋‚ด์šฉ์ด ์•„๋‹™๋‹ˆ๋‹ค.

๋‚ด ๊ถŒ์žฅ ์‚ฌํ•ญ: ์•„๋ฌด๊ฒƒ๋„ ํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค. ๋ง๊ทธ๋Œ€๋กœ ํ์‡„.

@freddyaboulton ์ด ํ–‰๋™์— ๋Œ€ํ•ด ๋™์˜ํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด ๋ง์„ ๋Š์ž.

@dsherry ๋‚˜๋Š” ์ด๊ฒƒ์ด ๋‘ ๊ฐ€์ง€ ์ด์œ ๋กœ ๋ณ€๊ฒฝํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

  1. ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์—์„œ ํ‰๊ฐ€๋˜๊ธฐ ๋•Œ๋ฌธ์— automl ๊ฒ€์ƒ‰์— ๋ถ„์‚ฐ์ด ๋„์ž…๋ฉ๋‹ˆ๋‹ค. ์ ์ˆ˜๊ฐ€ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์—์„œ ๊ณ„์‚ฐ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์œ„ํ‘œ๊ฐ€ ์•ฝ๊ฐ„ ์˜คํ•ด์˜ ์†Œ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๋ณ‘๋ ฌ ์ž๋™ ๊ฒ€์ƒ‰์— ์ข‹์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

    2์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ๋™์ž‘์œผ๋กœ ์ˆœ์ฐจ ์—”์ง„์€ ๊ฒ€์ƒ‰ ์ „๋ฐ˜์— ๊ฑธ์ณ ๋ฐ์ดํ„ฐ ์Šคํ”Œ๋ฆฌํ„ฐ์˜ ์ƒํƒœ๋ฅผ ์ˆ˜์ •ํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋ฉ๋‹ˆ๋‹ค. ๋ณ‘๋ ฌ evalml์—์„œ ๋ฐ์ดํ„ฐ ์Šคํ”Œ๋ฆฌํ„ฐ๋ฅผ ํ”ผํดํ•˜๊ณ  ์ž‘์—…์ž์—๊ฒŒ ๋ณด๋‚ด ๋ถ„ํ• ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ž‘์—…์ž๋Š” ์Šคํ”Œ๋ฆฌํ„ฐ์˜ ๋ณต์‚ฌ๋ณธ์„ ๊ฐ€์ ธ์˜ค๊ธฐ ๋•Œ๋ฌธ์— ์›๋ž˜ ๋ฐ์ดํ„ฐ ์Šคํ”Œ๋ฆฌํ„ฐ์˜ ์ƒํƒœ๋ฅผ ์ˆ˜์ •ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ํŒŒ์ดํ”„๋ผ์ธ์ด ํ‰๊ฐ€๋˜๋Š” ์ˆœ์„œ์— ๋”ฐ๋ผ ๋ถ„ํ• ์ด ์ผ์น˜ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์ฐจ ์—”์ง„๊ณผ ๋ณ‘๋ ฌ ์—”์ง„ ์‚ฌ์ด์— ๋™์ž‘์˜ ์ฐจ์ด๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค! ์ด๋Š” ๋™์ผํ•œ ํŒŒ์ดํ”„๋ผ์ธ/ํŒŒ๋ผ๋ฏธํ„ฐ ์ฝค๋ณด๊ฐ€ ์ˆœ์ฐจ ์—”์ง„๊ณผ ๋ณ‘๋ ฌ ์—”์ง„์—์„œ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ ์ด๋Š” ๋ฐ”๋žŒ์งํ•˜์ง€ ์•Š๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์ œ ์ƒ๊ฐ์—๋Š” ๋ชจ๋“  ํŒŒ์ดํ”„๋ผ์ธ์„ ์˜๋ฏธ ์žˆ๊ฒŒ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์œผ๋ ค๋ฉด ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ‰๊ฐ€ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— 1๋ฒˆ ํ•ญ๋ชฉ์ด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•œ ์ด์œ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ณ‘๋ ฌ evalml๋กœ ์ด๋™ํ•  ๋•Œ ์ „์—ญ ์ƒํƒœ๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์ด ์˜ˆ์ƒ๋˜๋Š” ๋™์ž‘์˜ ์ผ๋ถ€๊ฐ€ ์•„๋‹˜์„ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์•ž์œผ๋กœ์˜ ๊ณ„ํš:

  1. BalancedClassificationDataCVSplit ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์„ธ์š”.
  2. ์žฅ๊ธฐ์ ์œผ๋กœ automl ๊ฒ€์ƒ‰์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„ํ• ์„ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š”์ง€ ํ™•์ธํ•˜๋Š” ํ…Œ์ŠคํŠธ๋ฅผ ์ž‘์„ฑํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

ํ† ๋ก ์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค!

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰