Xgboost: Dask๋ฅผ ์‚ฌ์šฉํ•œ ๋ถ„์‚ฐ ์ปดํ“จํŒ…

์— ๋งŒ๋“  2017๋…„ 02์›” 13์ผ  ยท  46์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: dmlc/xgboost

์•ˆ๋…•ํ•˜์„ธ์š”, ์ €๋Š” Python์˜ ๋ณ‘๋ ฌ ๋ฐ ๋ถ„์‚ฐ ์ปดํ“จํŒ… ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ธ Dask ์˜ ์ €์ž์ž…๋‹ˆ๋‹ค. ์ด ์ปค๋ฎค๋‹ˆํ‹ฐ ๋‚ด์—์„œ ๋ณ‘๋ ฌ ๊ต์œก ๋˜๋Š” ETL์„ ์œ„ํ•ด Dask์—์„œ XGBoost๋ฅผ ๋ฐฐํฌํ•˜๋Š” ๋ฐ ํ˜‘๋ ฅํ•˜๋Š” ๋ฐ ๊ด€์‹ฌ์ด ์žˆ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

์ด ํ”„๋กœ์ ํŠธ์™€ ๊ด€๋ จ๋œ Dask์˜ ๋‘ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ์ž„์˜์˜ ๋™์  ์ž‘์—… ์Šค์ผ€์ค„๋ง์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋œ ๋ณ‘๋ ฌ ๋ฐ ๋ถ„์‚ฐ ์ปดํ“จํŒ…์„ ์œ„ํ•œ ์ผ๋ฐ˜ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ๊ด€๋ จ API๋Š” ์•„๋งˆ๋„ dask.delayed ๋ฐ concurrent.futures ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  2. Pandas API์˜ ๋ณ‘๋ ฌ ๋ฐ ๋ถ„์‚ฐ ํ•˜์œ„ ์ง‘ํ•ฉ์ธ dask.dataframe ์€ ๊ธฐ๋Šฅ ์—”์ง€๋‹ˆ์–ด๋ง ๋ฐ ๋ฐ์ดํ„ฐ ์‚ฌ์ „ ์ฒ˜๋ฆฌ์— ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ „์ฒด Pandas API๋ฅผ ๊ตฌํ˜„ํ•˜์ง€๋Š” ์•Š์ง€๋งŒ ์ƒ๋‹นํžˆ ๊ฐ€๊น์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์— ํ˜‘๋ ฅ์— ๊ด€์‹ฌ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๋…ธํŠธ๋ถ: https://gist.github.com/19c89d78e34437e061876a9872f4d2df
์งง์€ ์Šคํฌ๋ฆฐ์บ์ŠคํŠธ(6๋ถ„): https://youtu.be/Cc4E-PdDSro

๋น„ํŒ์  ํ”ผ๋“œ๋ฐฑ์€ ๋งค์šฐ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์‹œ ํ•œ ๋ฒˆ ์ด ๋ถ„์•ผ์— ๋Œ€ํ•œ ์ €์˜ ๋ฌด์ง€๋ฅผ ์šฉ์„œํ•ด ์ฃผ์‹ญ์‹œ์˜ค.

๋ชจ๋“  46 ๋Œ“๊ธ€

@mrocklin Dask๊ฐ€ sklearn๊ณผ ํ†ตํ•ฉ๋˜์–ด ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค. sklearn ๋ž˜ํผ๊ฐ€ ์ž‘๋™ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์…จ๋‚˜์š”?

๋ถ„์‚ฐ ์‹œ์Šคํ…œ๊ณผ ์˜๋ฏธ ์žˆ๊ฒŒ ํ†ตํ•ฉํ•˜๋ ค๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ˆ˜์ค€์ด ์•„๋‹ˆ๋ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ˆ˜์ค€์—์„œ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. SKLearn๊ณผ Dask๊ฐ€ ์„œ๋กœ๋ฅผ ๋„์šธ ์ˆ˜ ์žˆ๋Š” ๋ช‡ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์ง€๋งŒ, ํŠน๋ณ„ํžˆ ๊นŠ์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

Dask ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์€ ์ข‹์€ ์‹œ์ž‘์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฝ”๋“œ ๋ฒ ์ด์Šค์—๋Š” pandas ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด ์žˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์ด dask ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด ์‹œ์ž‘ํ•˜๊ธฐ์— ์ ํ•ฉํ•œ ๊ณณ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๋ฉ€ํ‹ฐ ํ…Œ๋ผ๋ฐ”์ดํŠธ dask ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ฐ€์ง€๊ณ  ๋„์ฐฉํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”? ๊ทธ๋ƒฅ ํŒ๋‹ค๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์ง„ํ–‰ํ•˜์‹œ๋‚˜์š”? ์•„๋‹ˆ๋ฉด dask ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‹ค์–‘ํ•œ pandas ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ฐ€๋ฆฌํ‚ค๋ฉด์„œ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด์—์„œ XGBoost๋ฅผ ์ง€๋Šฅ์ ์œผ๋กœ ๋ณ‘๋ ฌํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ?

์‚ฌ์šฉ์ž๊ฐ€ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ์‚ฌ์šฉ์ž๊ฐ€ partial_fit์„ ํ†ตํ•ด ํ˜œํƒ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

cc @tqchen ์€ ์ฝ”๋“œ์˜ ๋ถ„์‚ฐ ๋ถ€๋ถ„์— ๋” ์ต์ˆ™ํ•ฉ๋‹ˆ๋‹ค.

xgboost์˜ ๋ถ„์‚ฐ ๋ฒ„์ „์€ ๋ถ„์‚ฐ ์ž‘์—… ์‹คํ–‰๊ธฐ์— ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด์ƒ์ ์œผ๋กœ๋Š” xgboost์— ๋ฐ์ดํ„ฐ ํŒŒํ‹ฐ์…˜ ํ”ผ๋“œ๋ฅผ ๊ฐ€์ ธ์˜จ ๋‹ค์Œ ๊ณ„์†ํ•ฉ๋‹ˆ๋‹ค.

@mrocklin ๊ฐ€์žฅ ๊ด€๋ จ์„ฑ์ด ๋†’์€ ๋ถ€๋ถ„์€ xgboost-spark ๋ฐ xgboost-flink ๋ชจ๋“ˆ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์€ xgboost๋ฅผ spark/flink์˜ mapPartition ๊ธฐ๋Šฅ์— ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. Dask์—๋„ ๋น„์Šทํ•œ ๊ฒŒ ์žˆ์„ ๊ฒƒ ๊ฐ™์•„์š”

xgboost ์ธก์˜ ์š”๊ตฌ ์‚ฌํ•ญ์€ XGBoost๊ฐ€ rabit์— ์˜ํ•œ ํ”„๋กœ์„ธ์Šค ๊ฐ„ ์—ฐ๊ฒฐ์„ ์ฒ˜๋ฆฌํ•˜๊ณ  ํด๋ผ์ด์–ธํŠธ ์ธก์—์„œ (๊ฐ ์ž‘์—…์„ ์—ฐ๊ฒฐํ•˜๋Š”) ์ถ”์ ๊ธฐ๋ฅผ ์‹œ์ž‘ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala#L112 ์—์„œ ๊ด€๋ จ ์ฝ”๋“œ๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

Rabit์€ ๋‹ค๋ฅธ ๋ถ„์‚ฐ ์‹œ์Šคํ…œ์— ๋‚ด์žฅ๋˜๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํŒŒ์ด์ฌ ์ธก์—์„œ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ๊ทธ๋ฆฌ ์–ด๋ ต์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

Dask์—์„œ ๋‹ค๋ฅธ ๋ถ„์‚ฐ ์‹œ์Šคํ…œ์„ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฝค ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜ธ์ŠคํŒ… ๋ถ„์‚ฐ ์‹œ์Šคํ…œ(spark/flink/dask)์—์„œ xg-boost๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์ด๋™ํ•ฉ๋‹ˆ๊นŒ? ์•„๋‹ˆ๋ฉด ์†Œ๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ถ„์‚ฐ ๊ต์œก์„ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๊นŒ?

๋ณด๋‹ค ๊ตฌ์ฒด์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

  • ๋ชจ๋“  dask worker์—์„œ Rabit ์„œ๋ฒ„๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. Dask๋Š” ์ด๋Ÿฌํ•œ Rabit ์„œ๋ฒ„์— ์„œ๋กœ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ์ถฉ๋ถ„ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ํ˜„์žฌ ํ›ˆ๋ จ ๋ชจ๋ธ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ชจ๋“  ์ž‘์—…์ž์— ๋Œ€ํ•ด ์ผ๋ถ€ ๋กœ์ปฌ XGBoost ์ƒํƒœ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ์ž‘์—…์ž๋ณ„ ๊ฐœ์ฒด pandas ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋˜๋Š” numpy ๋ฐฐ์—ด์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ณต๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.
  • ์ค‘์ง€ํ•˜๋ผ๋Š” XGBoost์˜ ์‹ ํ˜ธ๋ฅผ ๋“ฃ์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด ๋‹น์‹ ์˜ ๊ธฐ๋Œ€์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๊นŒ? ๊ด€๋ จ Python API๋ฅผ ์•Œ๋ ค์ฃผ๊ธฐ ์‰ฝ์Šต๋‹ˆ๊นŒ?

์˜ˆ, ์—ฌ๊ธฐ์—์„œ ๊ด€๋ จ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค. https://github.com/dmlc/xgboost/blob/master/tests/distributed/ for python API.

์ถ”๊ฐ€๋กœ ํ•ด์•ผ ํ•  ์ผ์€ ๋“œ๋ผ์ด๋ฒ„ ์ธก(dask๋ฅผ ๊ตฌ๋™ํ•˜๋Š” ์žฅ์†Œ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ)์—์„œ ํ† ๋ผ ์ถ”์ ๊ธฐ๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ https://github.com/dmlc/dmlc-core ์˜ dmlc-submit ์Šคํฌ๋ฆฝํŠธ์—์„œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

์•Œ๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด์ „์˜ ๊ฐœ์š”๋ฅผ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.

XGBoost ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์ „์— Rabit ๋„คํŠธ์›Œํฌ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

๋“œ๋ผ์ด๋ฒ„/์Šค์ผ€์ค„๋Ÿฌ ๋…ธ๋“œ์—์„œ ํ† ๋ผ ์ถ”์ ๊ธฐ๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

envs = {'DMLC_NUM_WORKER' : nworker,
        'DMLC_NUM_SERVER' : nserver}

rabit = RabitTracker(hostIP=ip_address, nslave=num_workers)
envs.update(rabit.slave_envs())
rabit.start(args.num_workers)  # manages connections in background thread

PSTracker ๋ฅผ ์‹œ์ž‘ํ•˜๊ธฐ ์œ„ํ•ด ๋น„์Šทํ•œ ๊ณผ์ •์„ ๊ฑฐ์น  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋™์ผํ•œ ์ค‘์•™ ์ง‘์ค‘์‹ ์‹œ์Šคํ…œ์— ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๊นŒ ์•„๋‹ˆ๋ฉด ๋„คํŠธ์›Œํฌ ๋‚ด์˜ ๋‹ค๋ฅธ ๊ณณ์— ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๊นŒ? ์ด ๋ช‡ ๊ฐ€์ง€๊ฐ€ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๊นŒ? ์‚ฌ์šฉ์ž๊ฐ€ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

๊ฒฐ๊ตญ ๋‚ด ํŠธ๋ž˜์ปค(๋ฐ pstrackers?)๊ฐ€ rabit ๋„คํŠธ์›Œํฌ์— ๊ฐ€์ž…ํ•˜๊ณ  ์ฐจ๋‹จํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

rabit.join()  # join network

์ž‘์—…์ž ๋…ธ๋“œ์—์„œ ์ด๋Ÿฌํ•œ ํ™˜๊ฒฝ ๋ณ€์ˆ˜(์ผ๋ฐ˜ dask ์ฑ„๋„์„ ํ†ตํ•ด ์ด๋™ํ•จ)๋ฅผ ๋กœ์ปฌ ํ™˜๊ฒฝ์œผ๋กœ ๋คํ”„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ xgboost.rabit.init() ๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค.

import os
os.environ.update(envs)
xgboost.rabit.init()

Rabit ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด ํ™˜๊ฒฝ ๋ณ€์ˆ˜๊ฐ€ ์ด ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” ์œ ์ผํ•œ ๋ฐฉ๋ฒ•์ธ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ํŠธ๋ž˜์ปค ํ˜ธ์ŠคํŠธ/ํฌํŠธ ์ •๋ณด๋ฅผ ์ง์ ‘ ์ž…๋ ฅ์œผ๋กœ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ?

ํ›ˆ๋ จ

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‚ด numpy ๋ฐฐ์—ด/pandas ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„/scipy ํฌ์†Œ ๋ฐฐ์—ด์„ DMatrix ๊ฐœ์ฒด๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋น„๊ต์  ๊ฐ„๋‹จํ•ด ๋ณด์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ž‘์—…์ž๋‹น ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜๋ฉด train์„ ์—ฌ๋Ÿฌ ๋ฒˆ ํ˜ธ์ถœํ•˜๋Š” ๊น”๋”ํ•œ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋‹ค์Œ ์ค„์— ๋Œ€ํ•œ ์˜๊ฒฌ์ด ๊ฑฑ์ •๋ฉ๋‹ˆ๋‹ค.

# Run training, all the features in training API is available.
# Currently, this script only support calling train once for fault recovery purpose.
bst = xgb.train(param, dtrain, num_round, watchlist, early_stopping_rounds=2)

ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ๋„์ฐฉํ•  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ค์•ผ ํ•ฉ๋‹ˆ๊นŒ?

์˜ˆ์‹œ ๋ฐ์ดํ„ฐ์„ธํŠธ/๋ฌธ์ œ

์œ„์˜ ๋ชจ๋“  ๊ฒƒ์ด ์ •ํ™•ํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด ์‚ฌ๋žŒ๋“ค์ด ๋ฐ๋ชจ์— ์‚ฌ์šฉํ•˜๋Š” ํ‘œ์ค€ ๋ถ„์‚ฐ ๊ต์œก ์˜ˆ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

pstracker๋ฅผ ์‹œ์ž‘ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

  • ํŠธ๋ž˜์ปค๋Š” ์Šค์ผ€์ค„๋Ÿฌ(๋“œ๋ผ์ด๋ฒ„)์™€ ๊ฐ™์ด ํ•œ ๊ณณ์—์„œ๋งŒ ์‹œ์ž‘ํ•˜๋ฉด ๋˜๋ฉฐ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์€ ์ž‘์—…์ด ์—†์œผ๋ฉฐ ์ž‘์—…์„ ์—ฐ๊ฒฐํ•˜๋Š” ์—ญํ• ๋งŒ ํ•ฉ๋‹ˆ๋‹ค.
  • env args๋Š” rabit.init ์—์„œ kwargs๋กœ ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํŠธ๋ฆฌ ๋ถ€์ŠคํŒ…์€ ์ผ๊ด„ ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฏ€๋กœ ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์ˆ˜์ง‘๋  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    • ๊ทธ๋Ÿฌ๋‚˜ ๊ฐ ์ž‘์—…์ž๋Š” ๋ฐ์ดํ„ฐ์˜ ์ƒค๋“œ(ํ–‰์˜ ํ•˜์œ„ ์ง‘ํ•ฉ)๋งŒ ๊ฐ€์ ธ์˜ค๋ฉด ๋ฉ๋‹ˆ๋‹ค.

    • ์ด์ƒ์ ์œผ๋กœ๋Š” ๋ฐ์ดํ„ฐ ๋ฐ˜๋ณต ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜ ๋ฐฉ์‹์œผ๋กœ DMatrix์— ์ „๋‹ฌํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ์— ์žˆ์„ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

    • ์ด๊ฒƒ์€ ์•„์ง ํŒŒ์ด์ฌ ๋ž˜ํผ๊ฐ€ ์—†๋Š” https://github.com/dmlc/xgboost/blob/master/include/xgboost/c_api.h#L117 ์„ ํ†ตํ•ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

    • ์ฒซ ๋ฒˆ์งธ ์†”๋ฃจ์…˜์˜ ๊ฒฝ์šฐ ์–ด๋ ˆ์ด๋กœ ์ง์ ‘ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

์˜ค๋Š˜ ์•„์นจ์— ๊ฐ€์ง€๊ณ  ๋†€ ์‹œ๊ฐ„์ด ์ข€ ์žˆ์—ˆ์–ด์š”. ๊ฒฐ๊ณผ: https://github.com/mrocklin/dask-xgboost

์ง€๊ธˆ๊นŒ์ง€๋Š” ๋‹จ์ผ ๋ฉ”๋ชจ๋ฆฌ ๋‚ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๋ถ„์‚ฐ ํ•™์Šต๋งŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๋ช‡ ๊ฐ€์ง€ ์งˆ๋ฌธ์ด ์ƒ๊ฒผ์Šต๋‹ˆ๋‹ค.

  1. DMatrix ๊ฐ์ฒด๋ฅผ ์ง๋ ฌํ™”ํ•˜๊ณ  ์ „๋‹ฌํ•˜๋Š” ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?
  2. Booster ๊ฒฐ๊ณผ๋ฅผ ์ง๋ ฌํ™”ํ•˜๊ณ  ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?
  3. ์œ„์— ๋‚˜์—ด๋œ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋Š” rabit.init ์˜ ์ธ์ˆ˜์— ์–ด๋–ป๊ฒŒ ๋งคํ•‘๋ฉ๋‹ˆ๊นŒ? rabit.init ์— ๋Œ€ํ•œ ์˜ˆ์ƒ ์ž…๋ ฅ ํ˜•์‹์€ ์ •ํ™•ํžˆ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? slave_envs() ์˜ ๊ฒฐ๊ณผ๋ฅผ rabit.init์— ์ „๋‹ฌํ•˜๋ฉด ๋ชฉ๋ก์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ถ„๋ช…ํžˆ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ฐ ํ‚ค ์ด๋ฆ„์„ --key ๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ, ์•„๋งˆ๋„ DMLC ์ ‘๋‘์‚ฌ๋ฅผ ์‚ญ์ œํ•˜๊ณ  ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ• ๊นŒ์š”?
  4. ์ •ํ™•์„ฑ์„ ํ…Œ์ŠคํŠธํ•˜๋Š” ์ข‹์€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋‘ ๋ถ€์Šคํ„ฐ ๊ฐœ์ฒด๋ฅผ ์–ด๋–ป๊ฒŒ ๋น„๊ตํ•ฉ๋‹ˆ๊นŒ? ๋ถ„์‚ฐ ๊ต์œก์ด ์ •ํ™•ํžˆ ๋™์ผํ•œ ๊ฒฐ๊ณผ์™€ ์ˆœ์ฐจ ๊ต์œก์„ ์ƒ์„ฑํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?
  • ์ผ๋ฐ˜์ ์œผ๋กœ DMatrix๋ฅผ ์ง๋ ฌํ™”ํ•˜์ง€ ์•Š๊ณ  ๊ต์œก ์‹œ๊ฐ„ ๋ฐ์ดํ„ฐ ํ™€๋”์™€ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ dask(๋ฐฐ์—ด/๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„)์— ์˜ํ•ด ์ „๋‹ฌ๋˜๊ณ  ๊ณต์œ ๋œ ๋‹ค์Œ xgboost์— ์ „๋‹ฌ๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

    • ๋ฐ์ดํ„ฐ ๋ฐ˜๋ณต์ž๋ฅผ xgboost์— ๋…ธ์ถœํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ๋‚ด ๋ฐฐ์—ด์„ ํ†ตํ•ด ์ง์ ‘ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋” ๋‚˜์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์–‘์ชฝ์— xgboost๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ์œผ๋ฉด Booster๋ฅผ ํ”ผํดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ผ์ด ์–ด๋–ป๊ฒŒ ์ „๋‹ฌ๋˜๋Š”์ง€ ์ž์„ธํžˆ ์„ค๋ช…ํ•˜์ง€ ๋ชปํ•ด ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค.
rabit.init(['DMLC_KEY1=VALUE1', 'DMLC_KEY2=VALUE2']
  • ์ผ๋ฐ˜์ ์œผ๋กœ ๋ถ„์‚ฐ ๋ฐ ๋‹จ์ผ ๋จธ์‹ ์—์„œ ํ›ˆ๋ จ๋œ ๋ถ€์Šคํ„ฐ๋Š” ๋™์ผํ•˜์ง€ ์•Š์ง€๋งŒ ๋ช‡ ๊ฐ€์ง€ ํ™•์ธํ•ด์•ผ ํ•  ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

    • ๋ชจ๋“  ์ž‘์—…์ž๋กœ๋ถ€ํ„ฐ ๋ฐ˜ํ™˜๋œ ๋ถ€์Šคํ„ฐ๋Š” ๋™์ผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

    • ์˜ˆ์ธก ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์˜ค๋ฅ˜๋ฅผ ์ฐพ์œผ๋ ค๋ฉด ๋‹จ์ผ ๊ธฐ๊ณ„ ์ผ€์ด์Šค๋งŒํผ ๋‚ฎ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์œผ๋กœ ๋‘ ๊ฐ€์ง€ ์งˆ๋ฌธ์ด ๋” ์žˆ์Šต๋‹ˆ๋‹ค(์ €๋Š” XGBoost์— ๋Œ€ํ•œ ๊ฒฝํ—˜์ด ์—†๊ณ  ๊ธฐ๊ณ„ ํ•™์Šต์— ๋Œ€ํ•œ ์•ฝ๊ฐ„์˜ ๊ฒฝํ—˜๋งŒ ์žˆ์Šต๋‹ˆ๋‹ค. ์ œ ๋ฌด์ง€๋ฅผ ์šฉ์„œํ•ด ์ฃผ์‹ญ์‹œ์˜ค).

  1. ๋™์ผํ•œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์— ์—ฌ๋Ÿฌ ์ž‘์—…์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•ฉ๋ฆฌ์ ์ž…๋‹ˆ๊นŒ? (XGBoost๋Š” ๊ณ„์‚ฐ์ ์œผ๋กœ ์ œํ•œ๋˜์–ด ์žˆ์Šต๋‹ˆ๊นŒ?)
  2. ๋” ํฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ์ž‘์—…ํ•˜๋Š” ๊ฒฝ์šฐ ๊ฐ XGBoost ์ž‘์—…์ž์—๊ฒŒ ํ•ด๋‹น ๋ฐ์ดํ„ฐ๊ฐ€ ๋™๋ฃŒ์™€ ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๋ฆฌ๊ธฐ ์œ„ํ•ด ํŠน๋ณ„ํ•œ ์กฐ์น˜๋ฅผ ์ทจํ•ด์•ผ ํ•ฉ๋‹ˆ๊นŒ?

์–ด๋–ค ์‚ฌ์šฉ ์‚ฌ๋ก€๊ฐ€ ๋” ์ผ๋ฐ˜์ ์ž…๋‹ˆ๊นŒ?

๊ฐ ์ž‘์—…์€ ๋ฐ์ดํ„ฐ์˜ ๋‹ค๋ฅธ ํŒŒํ‹ฐ์…˜(ํ–‰๋ณ„)์—์„œ ์ž‘๋™ํ•ด์•ผ ํ•˜๋ฉฐ ๋™์ผํ•œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋ฉด ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

  • ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํžˆ ํฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๋ฒ„์ „์€ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ ์ž‘์—…์€ ํŒŒํ‹ฐ์…˜์—์„œ ๋ณ„๋„๋กœ ํ†ต๊ณ„๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์„œ๋กœ ๋™๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์ผ๋ฐ˜์ ์œผ๋กœ spark/flink์™€ ๊ฐ™์€ ํ”„๋ ˆ์ž„์›Œํฌ์˜ mapPartition ์ž‘์—…์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

๋‚ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์— 8๊ฐœ์˜ ํ–‰, 4๊ฐœ์˜ ์—ด์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‘ ๋ช…์˜ ์ž‘์—…์ž๋ฅผ ์‹œ์ž‘ํ•˜๋ฉด

  • ์ž‘์—…์ž 0์€ ํ–‰ 0-3์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค.
  • ์ž‘์—…์ž 1์€ ํ–‰ 4 -7์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค.

์ข‹์•„, ์ง€๊ธˆ ๊ฑฐ๊ธฐ ์žˆ๋Š” ๊ฒƒ์ด ์กฐ๊ธˆ ๋” ๊นจ๋—ํ•ด์กŒ์Šต๋‹ˆ๋‹ค. ๊ฐ ์ž‘์—…์ž์—์„œ ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ๋ฅผ ์†Œ๋น„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์ด ์žˆ์œผ๋ฉด ์ข‹๊ฒ ์ง€๋งŒ ์ง€๊ธˆ์€ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ์†”๋ฃจ์…˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. ํด๋Ÿฌ์Šคํ„ฐ์—์„œ dask ์–ด๋ ˆ์ด ๋˜๋Š” ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์œ ์ง€ํ•˜๊ณ  ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฝ๋‹ˆ๋‹ค.
  2. ๊ฐ ์ฒญํฌ/ํŒŒํ‹ฐ์…˜์ด ๋๋‚œ ์œ„์น˜ ์ฐพ๊ธฐ
  3. ๊ฐ ์ž‘์—…์ž์—๊ฒŒ ํ•ด๋‹น ์ฒญํฌ/ํŒŒํ‹ฐ์…˜์„ ์ •ํ™•ํžˆ ์—ฐ๊ฒฐํ•˜๊ณ  ํ•™์Šตํ•˜๋„๋ก ์ง€์‹œํ•ฉ๋‹ˆ๋‹ค.

์ด ์†”๋ฃจ์…˜์€ ๊ด€๋ฆฌ ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ ์ด์ƒ์ ์ด์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. xgboost-python์ด ๋„์ฐฉํ–ˆ์„ ๋•Œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์•„๋“ค์ผ ์ˆ˜ ์žˆ๋‹ค๋ฉด ํŽธ๋ฆฌํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‹ค์Œ์œผ๋กœ ํ•ด์•ผ ํ•  ์ผ์€ ์‹ค์ „์—์„œ ์‹œ๋„ํ•ด ๋ณด๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์ธํ„ฐ๋„ท์—์„œ ์ฐพ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ˆ„๊ตฐ๊ฐ€ ๋‚ด๊ฐ€ numpy ๋˜๋Š” pandas API๋กœ ์‰ฝ๊ฒŒ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์ธ์œ„์ ์ธ ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋•Œ๊นŒ์ง€ ๋ฌด์ž‘์œ„ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š” ๋‚ด ๋žฉํ†ฑ์˜ ๊ฐ„๋‹จํ•œ ์˜ˆ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

In [1]: import dask.dataframe as dd

In [2]: df = dd.demo.make_timeseries('2000', '2001', {'x': float, 'y': float, 'z': int}, freq='1s', partition_freq=
   ...: '1D')  # some random time series data

In [3]: df.head()
Out[3]: 
                            x         y     z
2000-01-01 00:00:00  0.778864  0.824796   977
2000-01-01 00:00:01 -0.019888 -0.173454  1023
2000-01-01 00:00:02  0.552826  0.051995  1083
2000-01-01 00:00:03 -0.761811  0.780124   959
2000-01-01 00:00:04 -0.643525  0.679375   980

In [4]: labels = df.z > 1000

In [5]: del df['z']

In [6]: df.head()
Out[6]: 
                            x         y
2000-01-01 00:00:00  0.778864  0.824796
2000-01-01 00:00:01 -0.019888 -0.173454
2000-01-01 00:00:02  0.552826  0.051995
2000-01-01 00:00:03 -0.761811  0.780124
2000-01-01 00:00:04 -0.643525  0.679375

In [7]: labels.head()
Out[7]: 
2000-01-01 00:00:00    False
2000-01-01 00:00:01     True
2000-01-01 00:00:02     True
2000-01-01 00:00:03    False
2000-01-01 00:00:04    False
Name: z, dtype: bool

In [8]: from dask.distributed import Client

In [9]: c = Client()  # creates a local "cluster" on my laptop

In [10]: from dask_xgboost import train
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [11]: param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}  # taken from example

In [12]: bst = train(c, param, df, labels)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'

In [13]: bst
Out[13]: <xgboost.core.Booster at 0x7fbaacfd17b8>

๋ˆ„๊ตฌ๋“ ์ง€ ์‚ดํŽด๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด ๊ด€๋ จ ์ฝ”๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: https://github.com/mrocklin/dask-xgboost/blob/master/dask_xgboost/core.py

๋‚ด๊ฐ€ ๋งํ–ˆ๋“ฏ์ด, ๋‚˜๋Š” XGBoost๋ฅผ ์ฒ˜์Œ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•„๋งˆ๋„ ๋†“์นœ ๋ถ€๋ถ„์ด ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์‹œ๋„ํ•  ์ผ๋ฐ˜์ ์ธ ์žฅ๋‚œ๊ฐ ์˜ˆ๋Š” https://github.com/dmlc/xgboost/tree/master/demo/data ์— ์žˆ์Šต๋‹ˆ๋‹ค.
libsvm ํ˜•์‹์ด๋ฉฐ numpy๋กœ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ์•ฝ๊ฐ„์˜ ๊ตฌ๋ฌธ ๋ถ„์„์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋” ํฐ ๊ฒƒ์ด ์žˆ์Šต๋‹ˆ๊นŒ(์‹ค์ œ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ)? ์•„๋‹ˆ๋ฉด ์ž„์˜์˜ ํฌ๊ธฐ์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ‘œ์ค€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ?

๋˜๋Š” ์•„๋งˆ๋„ ๋” ๋‚˜์€ ์งˆ๋ฌธ์€ "๋‹น์‹ (๋˜๋Š” ์ด ํ˜ธ๋ฅผ ์ฝ๋Š” ๋‹ค๋ฅธ ์‚ฌ๋žŒ)์ด ์—ฌ๊ธฐ์„œ ๋ณด๊ณ  ์‹ถ์€ ๊ฒƒ์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?"์ž…๋‹ˆ๋‹ค.

์ง€๊ธˆ ๊ฑด๋ฌผ ์˜ˆ์ธก. ๋ชจ๋ธ์„ ์ž‘์—…์ž๋กœ ๋‹ค์‹œ ์ด๋™ํ•˜๊ณ (ํ”ผํด/ํ”ผํด ํ•ด์ œ ํ”„๋กœ์„ธ์Šค๋ฅผ ํ†ตํ•ด) ์ผ๋ถ€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด bst.predict ๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋‹ค์Œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

Doing rabit call after Finalize

๋‚ด ๊ฐ€์ •์€ ์ด ์‹œ์ ์—์„œ ๋ชจ๋ธ์ด ๋…๋ฆฝ์ ์ด๋ฉฐ ๋” ์ด์ƒ rabit๋ฅผ ์‚ฌ์šฉํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํด๋ผ์ด์–ธํŠธ ์ปดํ“จํ„ฐ์—์„œ ์ž˜ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. predict ๋ฅผ ํ˜ธ์ถœํ•  ๋•Œ ์ด ์˜ค๋ฅ˜๊ฐ€ ํ‘œ์‹œ๋˜๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

์˜ˆ์ธก์˜ ์ผ๋ถ€๋Š” ์—ฌ์ „ํžˆ rabit๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฃผ๋กœ ์˜ˆ์ธก์ž๊ฐ€ ํ•™์Šต๊ณผ ๊ณต์œ ๋˜๋Š” ์ผ๋ถ€ ์ดˆ๊ธฐํ™” ๋ฃจํ‹ด๊ณผ ํ•จ๊ป˜ ํ•™์Šต์ž๋ฅผ ๊ณ„์† ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ์ด๊ฒƒ์€ ์ˆ˜์ •๋˜์–ด์•ผ ํ•˜์ง€๋งŒ ํ˜„์žฌ๋กœ์„œ๋Š” ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค.

๊ณตํ†ต ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ์ž˜ ์ž‘๋™ํ•˜๋Š” ํ•œ ํฅ๋ฏธ๋กœ์šด ์ถœ๋ฐœ์ ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์–ด์จŒ๋“  ์ค‘๊ฐ„ ๋ฐ์ดํ„ฐ์— ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ์ด์œ ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค(ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ ์Šค์ผ€์ค„๋ง์ด ์šฉ์ดํ•จ). ์ผ๋ถ€ pyspark ์‚ฌ์šฉ์ž๋Š” ์šฐ๋ฆฌ๊ฐ€ ์กฐ๊ธˆ ๊ด‘๊ณ ํ•˜๋ฉด ์‹œ๋„ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹ค์ œ๋กœ ์ค‘์š”ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์› ์Šต๋‹ˆ๋‹ค(์˜ˆ: 10์–ต ๊ฐœ์˜ ํ–‰์ด ์žˆ๋Š” 1๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์‹œ๋„). Kaggle์€ ๊ด€๋ จ์„ฑ์ด ์žˆ์„ ์ˆ˜ ์žˆ๋Š” ์•ฝ ์ฒœ๋งŒ ๊ฐœ์˜ ํฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฆฌํฌ์ง€ํ† ๋ฆฌ ๋Š” ์ˆ˜์ฒœ๋งŒ ๊ฐœ์˜ ํ–‰๊ณผ ์ˆ˜์ฒœ๋งŒ ๊ฐœ์˜ ์—ด(์› ํ•ซ ์ธ์ฝ”๋”ฉ ํ›„ ์ˆ˜์ฒœ?)์— ์žˆ๋Š” ํ•ญ๊ณต์‚ฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ์‹คํ—˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ฒค์น˜๋งˆํฌ์˜ ๊ฒฝ์šฐ 100k ํ–‰์˜ ์ƒ˜ํ”Œ์„ ๊ฐ€์ ธ ์™€์„œ ์ธ์œ„์ ์œผ๋กœ ์ƒ์„ฑํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด ์ƒ˜ํ”Œ์˜ ๋” ํฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ. ํ•„์š”ํ•œ ๊ฒฝ์šฐ ์ด๋ฅผ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ๋‹จ์ผ ์ฝ”์–ด์—์„œ pandas ๋ฐ xgboost์™€ ํ•จ๊ป˜ ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ค€๋น„, ๋งค๊ฐœ๋ณ€์ˆ˜ ๋˜๋Š” ์ด๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๋ชจ๋“  ๊ถŒ์žฅ ์‚ฌํ•ญ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค.

In [1]: import pandas as pd

In [2]: df = pd.read_csv('train-0.1m.csv')

In [3]: df.head()
Out[3]: 
  Month DayofMonth DayOfWeek  DepTime UniqueCarrier Origin Dest  Distance  \
0   c-8       c-21       c-7     1934            AA    ATL  DFW       732   
1   c-4       c-20       c-3     1548            US    PIT  MCO       834   
2   c-9        c-2       c-5     1422            XE    RDU  CLE       416   
3  c-11       c-25       c-6     1015            OO    DEN  MEM       872   
4  c-10        c-7       c-6     1828            WN    MDW  OMA       423   

  dep_delayed_15min  
0                 N  
1                 N  
2                 N  
3                 N  
4                 Y  

In [4]: labels = df.dep_delayed_15min == 'Y'

In [5]: del df['dep_delayed_15min']

In [6]: df = pd.get_dummies(df)

In [7]: len(df.columns)
Out[7]: 652

In [8]: import xgboost as xgb
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [9]: dtrain = xgb.DMatrix(df, label=labels)

In [10]: param = {}  # Are there better choices for parameters?  I could use help here

In [11]: bst = xgb.train(param, dtrain)  # or other parameters here?
[17:50:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 124 extra nodes, 0 pruned nodes, max_depth=6
[17:50:30] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[17:50:32] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[17:50:33] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 116 extra nodes, 0 pruned nodes, max_depth=6
[17:50:35] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 112 extra nodes, 0 pruned nodes, max_depth=6
[17:50:36] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 114 extra nodes, 0 pruned nodes, max_depth=6
[17:50:38] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 106 extra nodes, 0 pruned nodes, max_depth=6
[17:50:39] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 116 extra nodes, 0 pruned nodes, max_depth=6
[17:50:41] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 104 extra nodes, 0 pruned nodes, max_depth=6
[17:50:43] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 100 extra nodes, 0 pruned nodes, max_depth=6

In [12]: test = pd.read_csv('test.csv')

In [13]: test.head()
Out[13]: 
  Month DayofMonth DayOfWeek  DepTime UniqueCarrier Origin Dest  Distance  \
0   c-7       c-25       c-3      615            YV    MRY  PHX       598   
1   c-4       c-17       c-2      739            WN    LAS  HOU      1235   
2  c-12        c-2       c-7      651            MQ    GSP  ORD       577   
3   c-3       c-25       c-7     1614            WN    BWI  MHT       377   
4   c-6        c-6       c-3     1505            UA    ORD  STL       258   

  dep_delayed_15min  
0                 N  
1                 N  
2                 N  
3                 N  
4                 Y  

In [14]: test_labels = test.dep_delayed_15min == 'Y'

In [16]: del test['dep_delayed_15min']

In [17]: test = pd.get_dummies(test)

In [18]: len(test.columns)  # oops, looks like the columns don't match up
Out[18]: 670

In [19]: dtest = xgb.DMatrix(test)

In [20]: predictions = bst.predict(dtest)  # this fails because of mismatched columns

์–ด์จŒ๋“  ์—ฌ๊ธฐ์— ์˜ต์…˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•ญ๊ณต์‚ฌ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์ž˜ ์•Œ๋ ค์ง„ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๋ฉฐ ์‹ค์ œ๋กœ๋Š” ๋ถˆํŽธํ•  ์ •๋„๋กœ ์ปค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•˜์ง€๋งŒ ๊ธฐ๊ณ„ ํ•™์Šต์€ ๋‚ด ์ „๋ฌธ์ด ์•„๋‹ˆ๋ฏ€๋กœ ์ด๊ฒƒ์ด ์ ์ ˆํ•œ์ง€ ์•„๋‹Œ์ง€ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค.

cc @TomAugspurger , ์ด์— ๋Œ€ํ•ด ์ƒ๊ฐํ•˜๊ณ  ์žˆ๋Š” ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค.

Dask ๋ฐ ์˜ˆ์ธก์— ๊ด€ํ•ด์„œ๋Š” ํ•ญ์ƒ ๋ž˜๋น—์„ ๋‹ค์‹œ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๊ฒŒ์œผ๋ฅธ ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•˜๋Š” ๋Œ€์‹  ํ‰๊ฐ€๋ฅผ ๊ฐ•์ œํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•ฝ๊ฐ„ ๋ถ€์ •ํ™•ํ•˜๊ฒŒ ๋Š๊ปด์ง‘๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ์€ ์‚ฌ์šฉํ•˜๊ธฐ์— ์‹ฌ๊ฐํ•œ ์ฐจ๋‹จ๊ธฐ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค.

์˜ˆ์ธก์— ๋ช‡ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ ์งˆ๋ฌธ:

  1. ๋™์ผํ•œ ๋ž˜๋น— ์„ธ์…˜ ๋‚ด์—์„œ Booster.predict ๋ฅผ ์—ฌ๋Ÿฌ ๋ฒˆ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?
  2. ๋ณ„๋„์˜ ์Šค๋ ˆ๋“œ์—์„œ rabit.init , Booster.predict ๋ฐ rabit.finalize ๋ฅผ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

ํ˜„์žฌ ์ƒˆ ์ถ”์ ๊ธฐ๋ฅผ ๋งŒ๋“ค๊ณ  ์ž‘์—…์ž์˜ ๊ธฐ๋ณธ ์Šค๋ ˆ๋“œ์—์„œ rabit.init ๋ฅผ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ž‘์—…์ž ์Šค๋ ˆ๋“œ์—์„œ Booster.predict ๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด(๊ฐ dask ์ž‘์—…์ž๋Š” ๊ณ„์‚ฐ์„ ์œ„ํ•ด ์Šค๋ ˆ๋“œ ํ’€์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค) Doing rabit call after Finalize ์™€ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ถ”์ฒœ ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

์˜ˆ์ธก์˜ ์ผ๋ถ€๋Š” ์—ฌ์ „ํžˆ rabit๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ฃผ๋กœ ์˜ˆ์ธก์ž๊ฐ€ ํ•™์Šต๊ณผ ๊ณต์œ ๋˜๋Š” ์ผ๋ถ€ ์ดˆ๊ธฐํ™” ๋ฃจํ‹ด๊ณผ ํ•จ๊ป˜ ํ•™์Šต์ž๋ฅผ ๊ณ„์† ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ์ด๊ฒƒ์€ ์ˆ˜์ •๋˜์–ด์•ผ ํ•˜์ง€๋งŒ ํ˜„์žฌ๋กœ์„œ๋Š” ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ด๊ฒƒ์ด ๊ถ๊ธˆํ•˜๋‹ค. ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ์ž‘์—…์ž์—์„œ ๋‚ด ํด๋ผ์ด์–ธํŠธ ์ปดํ“จํ„ฐ๋กœ ์ง๋ ฌํ™”-์ „์†ก-์—ญ์ง๋ ฌํ™”ํ•œ ํ›„์—๋Š” ๋ž˜๋น— ๋„คํŠธ์›Œํฌ๊ฐ€ ์—†๋”๋ผ๋„ ์ผ๋ฐ˜ ๋ฐ์ดํ„ฐ์—์„œ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. Rabit์œผ๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ Rabit ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ƒ์‚ฐ์—์„œ๋„ ํ•„์š”ํ•  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ ๋ž˜๋น— ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ์ œ์•ฝ ์กฐ๊ฑด์— ๋Œ€ํ•ด ๋” ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

์˜ˆ์‹œ ๋ฐ์ดํ„ฐ์„ธํŠธ/๋ฌธ์ œ
์œ„์˜ ๋ชจ๋“  ๊ฒƒ์ด ์ •ํ™•ํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด ์‚ฌ๋žŒ๋“ค์ด ๋ฐ๋ชจ์— ์‚ฌ์šฉํ•˜๋Š” ํ‘œ์ค€ ๋ถ„์‚ฐ ๊ต์œก ์˜ˆ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

์ด ์‹คํ—˜์˜ ๊ฒฐ๊ณผ๋ฅผ ์žฌํ˜„ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

https://github.com/Microsoft/LightGBM/wiki/Experiments#parallel -์‹คํ—˜

XGBoost(#1950)์˜ ์ƒˆ๋กœ์šด binning + fast hist ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋ฉด ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ๋„ํ•  ์ผ๋ฐ˜์ ์ธ ์žฅ๋‚œ๊ฐ ์˜ˆ๋Š” https://github.com/dmlc/xgboost/tree/master/demo/data ์— ์žˆ์Šต๋‹ˆ๋‹ค.
libsvm ํ˜•์‹์ด๋ฉฐ numpy๋กœ ๊ฐ€์ ธ์˜ค๋ ค๋ฉด ์•ฝ๊ฐ„์˜ ๊ตฌ๋ฌธ ๋ถ„์„์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

sklearn์—์„œ ์ด PR์— ๊ด€์‹ฌ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. https://github.com/scikit-learn/scikit-learn/pull/935

@mrocklin ๋ชจ๋ธ ์žฌ์‚ฌ์šฉ์— ์ œ์•ฝ์ด ์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ถ„์‚ฐ ๋ฒ„์ „์—์„œ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์€ ์ง๋ ฌ ๋ฒ„์ „์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ์˜ˆ์ธก๊ธฐ์˜ ํ•œ๊ณ„(rabit๋กœ ์ปดํŒŒ์ผํ•  ๋•Œ)๊ฐ€ ํ›ˆ๋ จ ํ•จ์ˆ˜์™€ ํ˜ผํ•ฉ๋œ ๊ธฐ๋Šฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค(๊ทธ๋ž˜์„œ rabit ํ˜ธ์ถœ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค).

์ด์ œ ๋ง์”€ํ•˜์‹  ๋Œ€๋กœ ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ…์ด ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ rabit.init (์•„๋ฌด๊ฒƒ๋„ ์ „๋‹ฌํ•˜์ง€ ์•Š๊ณ  ์˜ˆ์ธก์ž๊ฐ€ ์ด๊ฒƒ์ด ์œ ์ผํ•œ ์ž‘์—…์ž๋ผ๊ณ  ์ƒ๊ฐํ•˜๊ฒŒ ํ•จ)๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด ์˜ˆ์ธก์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋„ค. ์‹ค์ œ๋กœ ๊ทธ๊ฒƒ์€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. dask-xgboost๋Š” ์ด์ œ ์˜ˆ์ธก์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค: https://github.com/mrocklin/dask-xgboost/commit/827a03d96977cda8d104899c9f42f52dac446165

@tqchen ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค!

๋‹ค์Œ์€ ๋กœ์ปฌ ๋žฉํ†ฑ์— ์žˆ๋Š” Airlines ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ž‘์€ ์ƒ˜ํ”Œ์—์„œ dask.dataframe ๋ฐ xgboost๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์›Œํฌํ”Œ๋กœ์ž…๋‹ˆ๋‹ค. ๋ชจ๋‘์—๊ฒŒ ๊ดœ์ฐฎ์•„ ๋ณด์ด๋‚˜์š”? ์—ฌ๊ธฐ์— ๋ˆ„๋ฝ๋œ XGBoost์˜ API ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๊นŒ?

In [1]: import dask.dataframe as dd

In [2]: import dask_xgboost as dxgb

In [3]: df = dd.read_csv('train-0.1m.csv')

In [4]: df.head()
Out[4]: 
  Month DayofMonth DayOfWeek  DepTime UniqueCarrier Origin Dest  Distance  \
0   c-8       c-21       c-7     1934            AA    ATL  DFW       732   
1   c-4       c-20       c-3     1548            US    PIT  MCO       834   
2   c-9        c-2       c-5     1422            XE    RDU  CLE       416   
3  c-11       c-25       c-6     1015            OO    DEN  MEM       872   
4  c-10        c-7       c-6     1828            WN    MDW  OMA       423   

  dep_delayed_15min  
0                 N  
1                 N  
2                 N  
3                 N  
4                 Y  

In [5]: labels = df.dep_delayed_15min == 'Y'

In [6]: del df['dep_delayed_15min']

In [7]: df = df.categorize()

In [8]: df = dd.get_dummies(df)

In [9]: data_train, data_test = df.random_split([0.9, 0.1], random_state=123)

In [10]: labels_train, labels_test = labels.random_split([0.9, 0.1], random_state=123)

In [11]: from dask.distributed import Client

In [12]: client = Client()  # in a large-data situation I probably should have done this before calling categorize above (which requires computation)

In [13]: param = {}  # Are there better choices for parameters?

In [14]: bst = dxgb.train(client, {}, data_train, labels_train)
[14:00:46] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:50] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 122 extra nodes, 0 pruned nodes, max_depth=6
[14:00:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:00:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 114 extra nodes, 0 pruned nodes, max_depth=6
[14:00:59] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:01:01] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:01:04] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 0 pruned nodes, max_depth=6
[14:01:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6

In [15]: bst
Out[15]: <xgboost.core.Booster at 0x7f689803af60>

In [16]: predictions = dxgb.predict(client, bst, data_test)

In [17]: predictions
Out[17]: 
Dask Series Structure:
npartitions=1
None    float32
None        ...
Name: predictions, dtype: float32
Dask Name: _predict_part, 9 tasks

์ œ ๋‹จ๊ธฐ ๋ชฉํ‘œ๋Š” ์ด์— ๋Œ€ํ•œ ์งง์€ ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ์„ ์ž‘์„ฑํ•˜์—ฌ XGBoost์— ๋Œ€ํ•œ ๋” ๋งŽ์€ ๊ฒฝํ—˜๊ณผ ๋” ๋งŽ์€ ์‹œ๊ฐ„์„ ๊ฐ€์ง„ ๋‹ค๋ฅธ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ์ด ํ”„๋กœ์ ํŠธ๋ฅผ ์ฑ„ํƒํ•˜๊ณ  ์ถ”์ง„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. (์ €๋„ ์—ฌ๊ธฐ ์žˆ๋Š” ๋‹ค๋ฅธ ๋ชจ๋“  ์‚ฌ๋žŒ๋“ค๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ด์™€ ๊ฐ™์€ ๋ช‡ ๊ฐ€์ง€ ๋‹ค๋ฅธ ํ”„๋กœ์ ํŠธ๋ฅผ ๋™์‹œ์— ์ง„ํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.)

๋‚˜๋Š” ์ด๋ฏธ S3 ๋ฒ„ํ‚ท์— ์ €์žฅ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Airlines ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋ถ€๋ถ„์ ์ž…๋‹ˆ๋‹ค. Criteo ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ๊ทœ๋ชจ์— ๋”ฐ๋ผ ๋” ๋‚˜์€ ๋ฐ๋ชจ๋ฅผ ์ œ๊ณตํ•  ๊ฒƒ์ด๋ผ๋Š” ์ ์—๋Š” ๋™์˜ํ•ฉ๋‹ˆ๋‹ค.

์–ด๋–ค ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”์ง€ ๋˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ด๋–ป๊ฒŒ ํŒ๋‹จํ•ด์•ผ ํ•˜๋Š”์ง€ ์•„์ง ์ž˜ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ๊ฒฝ์šฐ ์—ฌ๊ธฐ ์—์„œ @szilard ์˜ ์‹คํ—˜์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ์ธก์„ ํŒ๋‹จํ•˜๋Š” ์ข‹์€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๊นŒ? ์˜ˆ๋ฅผ ๋“ค์–ด labels_test ์™€ ์ผ์น˜ํ•˜๋Š” predictions > 0.5 ๋ฅผ ์ฐพ๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

์•„๋งˆ๋„ ์ด์ง„ ๋ถ„๋ฅ˜(ํŠนํžˆ ์—ฐ๊ตฌ ๋˜๋Š” ๊ฒฝ์Ÿ ์„ค์ •์—์„œ)์— ๋Œ€ํ•œ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์€ ROC ๊ณก์„  ์•„๋ž˜ ์˜์—ญ(AUC)์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด์ง€๋งŒ ์‹ค์ œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์—์„œ๋Š” "๋น„์ฆˆ๋‹ˆ์Šค" ๊ฐ’๊ณผ ์ผ์น˜ํ•˜๋Š” ๋ฉ”ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด label_test์™€ ์ผ์น˜ํ•˜๋„๋ก 0.5๋ณด๋‹ค ํฐ ์˜ˆ์ธก์„ ์ฐพ๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

๋„ค. ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ๊ทธ ํ‰๊ท ์„ ์ทจํ•˜๋ฉด ์ด๊ฒƒ์ด ํ…Œ์ŠคํŠธ ์ •ํ™•๋„์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ์ดํ„ฐ์„ธํŠธ๊ฐ€ ๋ถˆ๊ท ํ˜•ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค(ํด๋ฆญ๋ณด๋‹ค ํด๋ฆญ์ด ์—†๋Š” ๊ฒฝ์šฐ๊ฐ€ ํ›จ์”ฌ ๋งŽ์Œ). ์ด ๊ฒฝ์šฐ ROC AUC ์ ์ˆ˜๊ฐ€ ๋” ๋‚˜์€ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค.

from sklearn.metrics import roc_auc_score
print(roc_auc_score(labels_test, predictions))

predictions ๊ฐ€ ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ๊ฐ ํ–‰์— ๋Œ€ํ•ด ๋ชจ๋ธ์— ์˜ํ•ด ์ถ”์ •๋œ ์–‘์˜ ํ™•๋ฅ ์˜ 1D ๋ฐฐ์—ด์ด๋ผ๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

@mrocklin ํ•œ ๊ฐ€์ง€ ํ›„์† ์งˆ๋ฌธ์€ dask๊ฐ€ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์ž‘์—…์ž ์ž‘์—…์„ ํ—ˆ์šฉํ•ฉ๋‹ˆ๊นŒ? ๋‚˜๋Š” ์ด๊ฒƒ์ด GIL๋กœ ์ธํ•ด ํŒŒ์ด์ฌ๊ณผ ๊ทธ๋‹ค์ง€ ๊ด€๋ จ์ด ์—†๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ xgboost๋Š” ์ž‘์—…์ž๋‹น ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๊ต์œก์„ ํ—ˆ์šฉํ•˜๋ฉด์„œ ์—ฌ์ „ํžˆ ์„œ๋กœ ๋ถ„์‚ฐ์ ์œผ๋กœ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•ญ์ƒ xgboost์˜ nthread ์ธ์ˆ˜๋ฅผ ํ•ด๋‹น ์ž‘์—…์ž์˜ ์ž‘์—… ์ฝ”์–ด ์ˆ˜๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์งง์€ ๋Œ€๋‹ต์€ "์˜ˆ"์ž…๋‹ˆ๋‹ค. Dask์˜ ๋Œ€๋ถ€๋ถ„์€ NumPy, Pandas, SKLearn ๋ฐ Python์œผ๋กœ ๋ž˜ํ•‘๋œ C ๋ฐ Fortran ์ฝ”๋“œ์— ๋ถˆ๊ณผํ•œ ๊ธฐํƒ€ ํ”„๋กœ์ ํŠธ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. GIL์€ ์ด๋Ÿฌํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์–ด๋–ค ์‚ฌ๋žŒ๋“ค์€ PySpark RDD( dask.bag ์ฐธ์กฐ)์™€ ์œ ์‚ฌํ•œ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— Dask๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์˜ํ–ฅ์„ ๋ฐ›์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๊ทธ๋ฃน์€ ์†Œ์ˆ˜์— ์†ํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ, Dask๋Š” ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์ž‘์—…์„ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก XGBoost์— ์–ด๋–ป๊ฒŒ ์ง€์‹œํ•ฉ๋‹ˆ๊นŒ? ์ง€๊ธˆ๊นŒ์ง€์˜ ์‹คํ—˜์—์„œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๊ณ ๋„ ๋†’์€ CPU ์‚ฌ์šฉ๋ฅ ์„ ํ™•์ธํ–ˆ๋Š”๋ฐ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ชจ๋“  ๊ฒƒ์ด ์ž˜ ์ž‘๋™ํ• ๊นŒ์š”?

XGBoost๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ nthread๊ฐ€ ์„ค์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋จธ์‹ (ํ•ด๋‹น ์ž‘์—…์ž ๋Œ€์‹ )์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  CPU ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์—ฌ๋Ÿฌ ์ž‘์—…์ž๊ฐ€ ๋™์ผํ•œ ์‹œ์Šคํ…œ์— ํ• ๋‹น๋  ๋•Œ ๊ฒฝ์Ÿ ์กฐ๊ฑด์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ํ•ญ์ƒ nthread ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ž‘์—…์ž๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ์ฝ”์–ด ์ˆ˜๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์ž‘์—…์ž๋‹น 4๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

๋ฌผ๋ก ,
https://github.com/mrocklin/dask-xgboost/commit/c22d066b67c78710d5ad99b8620edc55182adc8f

2017๋…„ 2์›” 20์ผ ์›”์š”์ผ ์˜คํ›„ 6์‹œ 31๋ถ„, Tianqi Chen ์•Œ๋ฆผ @github.com
์ผ๋‹ค:

XGBoost๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  CPU๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
nthread๊ฐ€ ์„ค์ •๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋จธ์‹ ์˜ ์Šค๋ ˆ๋“œ(ํ•ด๋‹น ์ž‘์—…์ž ๋Œ€์‹ ).
์—ฌ๋Ÿฌ ์ž‘์—…์ž๊ฐ€ ๋™์ผํ•œ ์ž‘์—…์— ํ• ๋‹น๋  ๋•Œ ๊ฒฝ์Ÿ ์กฐ๊ฑด์ด ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ธฐ๊ณ„.

๋”ฐ๋ผ์„œ ํ•ญ์ƒ nthread ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ตœ๋Œ€ ๊ฐœ์ˆ˜๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.
์ž‘์—…์ž๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ฝ”์–ด. ์ผ๋ฐ˜์ ์œผ๋กœ ์ข‹์€ ์Šต๊ด€์€ say ์ฃผ์œ„์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ž‘์—…์ž๋‹น ์Šค๋ ˆ๋“œ 4๊ฐœ

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/dmlc/xgboost/issues/2032#issuecomment-281205747 ๋˜๋Š” ์Œ์†Œ๊ฑฐ
์Šค๋ ˆ๋“œ
https://github.com/notifications/unsubscribe-auth/AASszPELRoeIvqEzyJhkKumIs-vd0PHiks5reiJngaJpZM4L_PXa
.

๋…ธํŠธ๋ถ: https://gist.github.com/19c89d78e34437e061876a9872f4d2df
์งง์€ ์Šคํฌ๋ฆฐ์บ์ŠคํŠธ(6๋ถ„): https://youtu.be/Cc4E-PdDSro

๋น„ํŒ์  ํ”ผ๋“œ๋ฐฑ์€ ๋งค์šฐ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์‹œ ํ•œ ๋ฒˆ ์ด ๋ถ„์•ผ์— ๋Œ€ํ•œ ์ €์˜ ๋ฌด์ง€๋ฅผ ์šฉ์„œํ•ด ์ฃผ์‹ญ์‹œ์˜ค.

@mrocklin ๋ฉ‹์ง„ ๋ฐ๋ชจ! param dict์—์„œ 'tree_method': 'hist', 'grow_policy': 'lossguide' ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ(๋ฐ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰)์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

@ogrisel๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ต์œก ์‹œ๊ฐ„์ด 6๋ถ„์—์„œ 1๋ถ„์œผ๋กœ ๋Š˜์–ด๋‚ฉ๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์€ ๊ฑฐ์˜ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€๋˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ข‹์•„, ์ด๊ฒƒ์œผ๋กœ ๋Œ์•„์˜ค์ž. ์šฐ๋ฆฌ๊ฐ€ ๊ตฌํ˜„ํ•ด์•ผ ํ•  ๊ธฐ์ฐจ ๋ฐ ์˜ˆ์ธก ์ด์™ธ์˜ XGBoost ์ž‘์—…์ด ์žˆ์Šต๋‹ˆ๊นŒ?

@tqchen ๋˜๋Š” @ogrisel ์ค‘ ํ•œ ๋ช…์ด https://github.com/mrocklin/dask-xgboost/blob/master/dask_xgboost/core.py ์—์„œ ๊ตฌํ˜„์„ ์‚ดํŽด๋ณผ ์‹œ๊ฐ„์ด ์žˆ๋‹ค๋ฉด ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์™ธ๊ตญ ์ฝ”๋“œ๋ฒ ์ด์Šค๋ฅผ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ด ์šฐ์„ ์ˆœ์œ„ ๋ชฉ๋ก์—์„œ ํ•ญ์ƒ ๋†’์€ ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  ๊ฒƒ์ด ์ •์ƒ์ด๋ฉด README์— ์กฐ๊ธˆ ๋” ์ถ”๊ฐ€ํ•˜๊ณ  PyPI์— ๊ฒŒ์‹œํ•˜๋ฉด ์ด ๋ฌธ์ œ๋ฅผ ์ข…๋ฃŒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ํ›ˆ๋ จํ•˜๊ณ  ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ๋งŒ์ด ๋ฐฐํฌ๋˜์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๊ฒƒ๋“ค์€ ๋ฐ์ดํ„ฐ์…‹์— ์‘๋‹ตํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋ฐฐํฌํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

dask-xgboost๋ฅผ PyPI๋กœ ํ‘ธ์‹œํ•˜๊ณ  https://github.com/dask/dask-xgboost ๋กœ ์˜ฎ๊ฒผ์Šต๋‹ˆ๋‹ค.

๋„์›€์„ ์ฃผ์‹  @tqchen ๊ณผ @ogrisel ์—๊ฒŒ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ํ˜‘์—…์„ ํ†ตํ•ด ์ด๋ฅผ ๋น„๊ต์  ์‰ฝ๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ๋„์›€์ด ๋˜์—ˆ์œผ๋ฉด ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ „๊นŒ์ง€๋Š” ๋‹ซ์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰