ããã«ã¡ã¯ãç§ã¯Pythonã®äžŠååæ£ã³ã³ãã¥ãŒãã£ã³ã°çšã®ã©ã€ãã©ãªã§ããDaskã®äœè ã§ãã ãã®ã³ãã¥ããã£å ã§ã䞊åãã¬ãŒãã³ã°ãŸãã¯ETLã®ããããã®ããã«Daskã§XGBoostãé åžããããšã«ååããããšã«é¢å¿ããããã©ããèå³ããããŸãã
ãã®ãããžã§ã¯ãã«é¢é£ããDaskã®ã³ã³ããŒãã³ãã¯ãããã2ã€ãããŸãã
ããã§ã®ã³ã©ãã¬ãŒã·ã§ã³ã«èå³ã¯ãããŸããïŒ
@mrocklinDaskã¯sklearnãšçµ±åãããŠãããšæããŸããã sklearnã©ãããŒãèŠãŠããããæ©èœãããã©ããã確èªããŸãããïŒ
æå³ã®ããåæ£ã·ã¹ãã ãšã®çµ±åã¯ãéåžžãã©ã€ãã©ãªã¬ãã«ã§ã¯ãªããã¢ã«ãŽãªãºã ã¬ãã«ã§å®è¡ããå¿ èŠããããŸãã SKLearnãšDaskãäºãã«å©ãåãæ¹æ³ã¯ããã€ããããŸãããããã»ã©æ·±ãã¯ãããŸããã
DaskããŒã¿ãã¬ãŒã ã¯è¯ãã¹ã¿ãŒãã§ãã ã³ãŒãããŒã¹ã§ã¯ããã³ãã®ããŒã¿ãã¬ãŒã ããã§ãã¯ããŠããŸãã ããããdaskããŒã¿ãã¬ãŒã ãåºçºç¹ãšããŠé©ããŠããå ŽæãããããŸããã
ã§ã¯ã誰ããæ°ãã©ãã€ãã®daskããŒã¿ãã¬ãŒã ãæã£ãŠå°çããå Žåã¯ã©ããªãã§ããããã ããããã³ãã«å€æããŠç¶è¡ããŸããïŒ ãŸãã¯ãã¯ã©ã¹ã¿ãŒå šäœã§XGBoostãã€ã³ããªãžã§ã³ãã«äžŠååããdaskããŒã¿ãã¬ãŒã ãæ§æããããŸããŸãªãã³ãããŒã¿ãã¬ãŒã ãæãæ¹æ³ã¯ãããŸããïŒ
ãŠãŒã¶ãŒã¯ããããµã€ãºãæå®ã§ããŸããïŒ ç§ã¯ãpartial_fitãéããŠãŠãŒã¶ãŒã«å©çãããããããšãã§ãããšæããŸãã
ã³ãŒãã®åæ£éšåã«ç²ŸéããŠããcc @ tqchen ã
åæ£ããŒãžã§ã³ã®xgboostã¯ãåæ£ãžã§ãã©ã³ãã£ãŒã«ããã¯ã§ããŸããçæ³çã«ã¯ãããŒã¿ããŒãã£ã·ã§ã³ãã£ãŒããxgboostã«åã蟌ã¿ãç¶è¡ããŸãã
@mrocklinæãé¢é£æ§ã®é«ãéšåã¯ãxgboostãspark / flinkã®mapPartitioné¢æ°ã«åã蟌ãxgboost-sparkããã³xgboost-flinkã¢ãžã¥ãŒã«ã ãšæããŸãã Daskã«ã䌌ããããªãã®ããããšæããŸã
xgbooståŽããã®èŠä»¶ã¯ãXGBoostãããã»ã¹éæ¥ç¶ãã©ãããã§åŠçããã¯ã©ã€ã¢ã³ãåŽããïŒåãžã§ããæ¥ç¶ããïŒãã©ãã«ãŒãéå§ããå¿ èŠãããããšã§ãã
https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala#L112ã®é¢é£ã³ãŒããåç §ããŠãã ãã
Rabitã¯ä»ã®åæ£ã·ã¹ãã ã«çµã¿èŸŒãŸããããã«èšèšãããŠããã®ã§ãPythonåŽã§èª¿æŽããã®ã¯ããã»ã©é£ããããšã§ã¯ãªããšæããŸãã
Daskããä»ã®åæ£ã·ã¹ãã ãèµ·åããããšã¯ãéåžžãããªãå®è¡å¯èœã§ãã ãã¹ãã£ã³ã°åæ£ã·ã¹ãã ïŒspark / flink / daskïŒããxg-boostã«ããŒã¿ãã©ã®ããã«ç§»åããŸããïŒ ãããšããããã¯å°ããªããŒã¿ã®åæ£ãã¬ãŒãã³ã°çšã§ããïŒ
å ·äœçã«ã¯ã次ã®ãããªã·ã¹ãã ãæ§ç¯ããããšãæåŸ ããŠããŸãã
ããã¯ããªãã®æåŸ ãšäžèŽããŸããïŒ é¢é£ããPythonAPIã玹ä»ããã®ã¯ç°¡åã§ããïŒ
ã¯ããPython APIã«ã€ããŠã¯ã httpsïŒ//github.com/dmlc/xgboost/blob/master/tests/distributed/ã®é¢é£æ å ±ãåç §ããŠãã ããã
ããã«è¡ãå¿ èŠãããã®ã¯ããã©ã€ããŒåŽïŒdaskãé§åããå Žæã§ããå¯èœæ§ãé«ãïŒã§ã©ããããã©ãã«ãŒãèµ·åããããšã§ããããã¯ã httpsïŒ //github.com/dmlc/dmlc-coreã®dmlc-submitã¹ã¯ãªããã§å®è¡ãããŸãã
OKãåããç§ã®ã¢ãŠãã©ã€ã³ãèšå ¥ããŠãã ããïŒ
ãã©ã€ããŒ/ã¹ã±ãžã¥ãŒã©ãŒããŒãã§ãã©ããããã©ãã«ãŒãèµ·åããŸã
envs = {'DMLC_NUM_WORKER' : nworker,
'DMLC_NUM_SERVER' : nserver}
rabit = RabitTracker(hostIP=ip_address, nslave=num_workers)
envs.update(rabit.slave_envs())
rabit.start(args.num_workers) # manages connections in background thread
åæ§ã®ããã»ã¹ãçµãŠPSTracker
ãéå§ããããšãã§ããŸãã ããã¯åãéäžåãã·ã³äžã«ããã¹ãã§ããããããšããããã¯ãŒã¯å
ã®ä»ã®å Žæã«ããã¹ãã§ããïŒ ãããã®ããã€ããããã¹ãã§ããïŒ ããã¯ãŠãŒã¶ãŒãæ§æã§ããå¿
èŠããããŸããïŒ
æçµçã«ããã©ãã«ãŒïŒããã³pstrackerïŒïŒãã©ããããããã¯ãŒã¯ã«åå ãããŠãããã¯ããŸãã
rabit.join() # join network
ã¯ãŒã«ãŒããŒãã§ã¯ããããã®ç°å¢å€æ°ïŒéåžžã®daskãã£ãã«ãä»ããŠç§»åããŸãïŒãããŒã«ã«ç°å¢ã«ãã³ãããå¿
èŠããããŸãã 次ã«ã xgboost.rabit.init()
ãåŒã³åºãã ãã§ååã§ãã
import os
os.environ.update(envs)
xgboost.rabit.init()
Rabitã³ãŒããèŠããšãç°å¢å€æ°ããã®æ å ±ãæäŸããå¯äžã®æ¹æ³ã§ããããã«èŠããŸãã ããã確èªã§ããŸããïŒ ãã©ãã«ãŒã®ãã¹ã/ããŒãæ å ±ãçŽæ¥å ¥åãšããŠæäŸããæ¹æ³ã¯ãããŸããïŒ
次ã«ãnumpyé å/ pandasããŒã¿ãã¬ãŒã / scipyã¹ããŒã¹é åãDMatrixãªããžã§ã¯ãã«å€æããŸããããã¯æ¯èŒçç°¡åã«æããŸãã ãã ããã¯ãŒã«ãŒããšã«è€æ°ã®ããŒã¿ããããããå¯èœæ§ããããŸãã å©çšå¯èœã«ãªã£ããšãã«ãããå€ãã®ããŒã¿ã䜿çšããŠé»è»ãæ°ååŒã³åºãã¯ãªãŒã³ãªæ¹æ³ã¯ãããŸããïŒ ç§ã¯ãããã®è¡ã®ã³ã¡ã³ãã«ã€ããŠå¿é ããŠããŸãïŒ
# Run training, all the features in training API is available.
# Currently, this script only support calling train once for fault recovery purpose.
bst = xgb.train(param, dtrain, num_round, watchlist, early_stopping_rounds=2)
ãã¬ãŒãã³ã°ãéå§ããåã«ããã¹ãŠã®ããŒã¿ãå°çããã®ãåŸ ã€å¿ èŠããããŸããïŒ
äžèšã®ãã¹ãŠãæ£ãããšä»®å®ãããšã人ã ããã¢ã³ã¹ãã¬ãŒã·ã§ã³ã«äœ¿çšããæšæºã®åæ£ãã¬ãŒãã³ã°ã®äŸã¯ãããŸããïŒ
pstrackerãèµ·åããå¿ èŠã¯ãããŸããã
ä»æã¯éã¶æéããããŸããã ããã§ã®çµæïŒ https ïŒ//github.com/mrocklin/dask-xgboost
ãããŸã§ã®ãšãããåäžã®ã¡ã¢ãªå ããŒã¿ã»ããã®åæ£åŠç¿ã®ã¿ãåŠçããŸãã ããã€ãã®è³ªåãçããŸããïŒ
rabit.init
ã®åŒæ°ã«ã©ã®ããã«ãããã³ã°ãããŸããïŒ rabit.init
ãžã®å
¥åã®äºæ³ããã圢åŒã¯æ£ç¢ºã«ã¯äœã§ããïŒ slave_envs()
ã®çµæãrabit.initã«æž¡ãããšã¯ããªã¹ããæåŸ
ããŠãããããæããã«æ©èœããŸããã åããŒåã--key
ã«å€æããå¿
èŠããããŸããïŒããããDMLC
ãã¬ãã£ãã¯ã¹ãåé€ããŠãå°æåã«å€æããŸããïŒrabit.init(['DMLC_KEY1=VALUE1', 'DMLC_KEY2=VALUE2']
ãããã©ã®ããã«äœ¿çšããããã«ã€ããŠã®äžè¬çãªè³ªåãããã«2ã€ãããŸãïŒXGBoostã®çµéšã¯ãªããæ©æ¢°åŠç¿ã®çµéšã¯ãããã§ããç¡ç¥ãèš±ããŠãã ããïŒã
ããäžè¬çãªãŠãŒã¹ã±ãŒã¹ã¯ã©ãã§ããïŒ
åäœæ¥ã¯ãããŒã¿ã®ç°ãªãããŒãã£ã·ã§ã³ïŒè¡ããšïŒã§æ©èœããå¿ èŠããããåãå ¥åããŒã¿ãåç §ããã¹ãã§ã¯ãããŸããã
ããã¯éåžžãspark / flinkãªã©ã®ãã¬ãŒã ã¯ãŒã¯ã§ã®mapPartitionæäœã«å¯Ÿå¿ããŸã
2ã€ã®ã¯ãŒã«ãŒãéå§ããå ŽåãããŒã¿ã»ããã«8è¡4åããããšããŸãã
OKãä»ããã«ããã®ã¯å°ããããã§ãã çµæãåã¯ãŒã«ãŒã§çæãããã®ã§ãçµæãæ¶è²»ããèœåãããã°ããã®ã§ãããä»ã®ãšãããããåé¿ããŠããŸãã çŸåšã®è§£æ±ºçã¯æ¬¡ã®ãšããã§ãã
ãã®ãœãªã¥ãŒã·ã§ã³ã¯ç®¡çããããããã«èŠããŸãããçæ³çã§ã¯ãããŸããã xgboost-pythonãå°çããçµæãåãå ¥ããããšãã§ããã°äŸ¿å©ã§ãã ãããã次ã«ããã¹ãããšã¯ãå®éã«è©ŠããŠã¿ãããšã ãšæããŸãã
äŸãšããŠã€ã³ã¿ãŒããããèŠãŠåãã€ããã§ãã 誰ããå¶ç¶ã«ãnumpyãŸãã¯pandasAPIã䜿çšããŠç°¡åã«çæã§ãã人çºçãªåé¡ãæ±ããŠããå Žåã¯æè¿ããŸãã ãããŸã§ã¯ãã©ã³ãã ãªããŒã¿ã䜿çšããã©ãããããã®ç°¡åãªäŸã次ã«ç€ºããŸãã
In [1]: import dask.dataframe as dd
In [2]: df = dd.demo.make_timeseries('2000', '2001', {'x': float, 'y': float, 'z': int}, freq='1s', partition_freq=
...: '1D') # some random time series data
In [3]: df.head()
Out[3]:
x y z
2000-01-01 00:00:00 0.778864 0.824796 977
2000-01-01 00:00:01 -0.019888 -0.173454 1023
2000-01-01 00:00:02 0.552826 0.051995 1083
2000-01-01 00:00:03 -0.761811 0.780124 959
2000-01-01 00:00:04 -0.643525 0.679375 980
In [4]: labels = df.z > 1000
In [5]: del df['z']
In [6]: df.head()
Out[6]:
x y
2000-01-01 00:00:00 0.778864 0.824796
2000-01-01 00:00:01 -0.019888 -0.173454
2000-01-01 00:00:02 0.552826 0.051995
2000-01-01 00:00:03 -0.761811 0.780124
2000-01-01 00:00:04 -0.643525 0.679375
In [7]: labels.head()
Out[7]:
2000-01-01 00:00:00 False
2000-01-01 00:00:01 True
2000-01-01 00:00:02 True
2000-01-01 00:00:03 False
2000-01-01 00:00:04 False
Name: z, dtype: bool
In [8]: from dask.distributed import Client
In [9]: c = Client() # creates a local "cluster" on my laptop
In [10]: from dask_xgboost import train
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
In [11]: param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'} # taken from example
In [12]: bst = train(c, param, df, labels)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
In [13]: bst
Out[13]: <xgboost.core.Booster at 0x7fbaacfd17b8>
誰ããèŠãŠã¿ããå Žåã¯ãé¢é£ããã³ãŒããããã«ãããŸãïŒ https ïŒ//github.com/mrocklin/dask-xgboost/blob/master/dask_xgboost/core.py
ç§ãèšã£ãããã«ãç§ã¯XGBoostã«äžæ £ããªã®ã§ãããããç©äºãæ¬ ããŠããŸãã
è©ŠããŠã¿ãå
žåçãªããã¡ãã®äŸã¯https://github.com/dmlc/xgboost/tree/master/demo/dataã«ãããŸã
ããã¯libsvm圢åŒã§ãããnumpyã«ããã«ã¯å°ã解æããå¿
èŠããããŸã
ãã倧ããªãã®ïŒå®éã«ã¯ã©ã¹ã¿ãŒãå¿ èŠã«ãªããã®ïŒã¯ãããŸããïŒ ãŸãã¯ãä»»æã®ãµã€ãºã®ããŒã¿ââã»ãããçæããæšæºçãªæ¹æ³ã¯ãããŸããïŒ
ãŸãã¯ãããããããè¯ã質åã¯ããããªãïŒãŸãã¯ãã®åé¡ãèªãã§ããä»ã®äººïŒã¯ããã§äœãèŠããã§ããïŒãã§ãã
建ç©ã¯ä»äºæž¬ããŸãã ã¢ãã«ãã¯ãŒã«ãŒã«æ»ãïŒãã¯ã«ã¹/ã¢ã³ãã¯ã«ããã»ã¹ãå®è¡ïŒãäžéšã®ããŒã¿ã§bst.predict
ãåŒã³åºããšã次ã®ãšã©ãŒãçºçããŸãã
Doing rabit call after Finalize
ç§ã®ä»®å®ã§ã¯ããã®æç¹ã§ãã¢ãã«ã¯èªå·±å®çµåã§ããããã¯ããŠãµã®ã䜿çšããå¿
èŠã¯ãããŸããã ã¯ã©ã€ã¢ã³ããã·ã³ã§ã¯æ£åžžã«åäœããŠããããã§ãã predict
ãåŒã³åºããšãã«ãã®ãšã©ãŒãçºçããå¯èœæ§ãããçç±ã¯ãããŸããïŒ
äºæž¬ã®äžéšã¯äŸç¶ãšããŠrabitã䜿çšããŠããŸããããã¯äž»ã«ãäºæž¬åããã¬ãŒãã³ã°ãšå ±æãããããã€ãã®åæåã«ãŒãã³ã§åŠç¿è ã䜿çšããŠããããã§ãã æçµçã«ã¯ãããä¿®æ£ããå¿ èŠããããŸãããä»ã®ãšãããããåœãŠã¯ãŸããŸãã
å ±éã®ããŒã¿ã»ããã§åé¡ãªãæ©èœããéããèå³æ·±ãåºçºç¹ã«ãªããšæããŸãã
ãšã«ããäžçšåºŠã®ããŒã¿ã«ã¯ã©ã¹ã¿ãŒã䜿çšããçç±ããããŸãïŒã¯ã©ã¹ã¿ãŒç°å¢ã§ã®ã¹ã±ãžã¥ãŒãªã³ã°ã容æã§ãïŒãpysparkãŠãŒã¶ãŒã®äžã«ã¯ãå°ã宣äŒããã°è©ŠããŠã¿ãããšæã人ããããããããŸããã
æ¬åœã«éèŠãªããŒã¿ã»ããããã¹ãããã®ã¯å€§å€ã§ãããããšãã°ã10åè¡ã®ããŒã¿ã»ããã1ã€è©ŠããŠã¿ãŠãã ããã Kaggleã¯ãé¢é£æ§ã®ããçŽ1,000äžã®å€§ããªããŒã¿ã»ããã§ããå¯èœæ§ããããŸãã
ãã®ãªããžããªã¯ãèªç©ºäŒç€Ÿã®ããŒã¿ã»ããã«å¯Ÿããå®éšã瀺ããŠããŸããããã¯ãæ°åäžè¡ãšæ°ååã«ãããšæããŸãïŒ1åã®ããããšã³ã³ãŒãã£ã³ã°ã®åŸïŒïŒãã³ãããŒã¯ã§ã¯ã10äžè¡ã®ãµã³ãã«ãååŸãã人工çã«çæãããããã§ãããã®ãµã³ãã«ããã®ãã倧ããªããŒã¿ã»ããã ãããããå¿ èŠã«å¿ããŠãããã¹ã±ãŒã«ã¢ããããããšãã§ããŸãã
ããã¯ãã·ã³ã°ã«ã³ã¢ã§ãã³ããšxgboostã§ãã®ããŒã¿ã䜿çšããäŸã§ãã ããŒã¿ã®æºåããã©ã¡ãŒã¿ããŸãã¯ãããé©åã«è¡ãæ¹æ³ã«é¢ããæšå¥šäºé ã¯å€§æè¿ã§ãã
In [1]: import pandas as pd
In [2]: df = pd.read_csv('train-0.1m.csv')
In [3]: df.head()
Out[3]:
Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance \
0 c-8 c-21 c-7 1934 AA ATL DFW 732
1 c-4 c-20 c-3 1548 US PIT MCO 834
2 c-9 c-2 c-5 1422 XE RDU CLE 416
3 c-11 c-25 c-6 1015 OO DEN MEM 872
4 c-10 c-7 c-6 1828 WN MDW OMA 423
dep_delayed_15min
0 N
1 N
2 N
3 N
4 Y
In [4]: labels = df.dep_delayed_15min == 'Y'
In [5]: del df['dep_delayed_15min']
In [6]: df = pd.get_dummies(df)
In [7]: len(df.columns)
Out[7]: 652
In [8]: import xgboost as xgb
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
In [9]: dtrain = xgb.DMatrix(df, label=labels)
In [10]: param = {} # Are there better choices for parameters? I could use help here
In [11]: bst = xgb.train(param, dtrain) # or other parameters here?
[17:50:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 124 extra nodes, 0 pruned nodes, max_depth=6
[17:50:30] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[17:50:32] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[17:50:33] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 116 extra nodes, 0 pruned nodes, max_depth=6
[17:50:35] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 112 extra nodes, 0 pruned nodes, max_depth=6
[17:50:36] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 114 extra nodes, 0 pruned nodes, max_depth=6
[17:50:38] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 106 extra nodes, 0 pruned nodes, max_depth=6
[17:50:39] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 116 extra nodes, 0 pruned nodes, max_depth=6
[17:50:41] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 104 extra nodes, 0 pruned nodes, max_depth=6
[17:50:43] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 100 extra nodes, 0 pruned nodes, max_depth=6
In [12]: test = pd.read_csv('test.csv')
In [13]: test.head()
Out[13]:
Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance \
0 c-7 c-25 c-3 615 YV MRY PHX 598
1 c-4 c-17 c-2 739 WN LAS HOU 1235
2 c-12 c-2 c-7 651 MQ GSP ORD 577
3 c-3 c-25 c-7 1614 WN BWI MHT 377
4 c-6 c-6 c-3 1505 UA ORD STL 258
dep_delayed_15min
0 N
1 N
2 N
3 N
4 Y
In [14]: test_labels = test.dep_delayed_15min == 'Y'
In [16]: del test['dep_delayed_15min']
In [17]: test = pd.get_dummies(test)
In [18]: len(test.columns) # oops, looks like the columns don't match up
Out[18]: 670
In [19]: dtest = xgb.DMatrix(test)
In [20]: predictions = bst.predict(dtest) # this fails because of mismatched columns
ãšã«ãããããã«ãªãã·ã§ã³ããããŸãã èªç©ºäŒç€Ÿã®ããŒã¿ã»ããã¯ããç¥ãããŠããããã§ãå®éã«ã¯äžäŸ¿ãªã»ã©å€§ãããªãå¯èœæ§ããããŸãã ç¹°ãè¿ãã«ãªããŸãããæ©æ¢°åŠç¿ã¯ç§ã®å°éã§ã¯ãªãã®ã§ããããé©åãã©ããã¯ããããŸããã
cc @TomAugspurger ãããã«ã€ããŠèããŠãããããããªããããªäººã®ããã§ãã
Daskãšpredictã«é¢ããŠã¯ããã€ã§ãåã³rabitãèšå®ã§ããŸãã ããã¯ãç©äºãæ æ°ã«ä¿ã€ã®ã§ã¯ãªããè©äŸ¡ã匷å¶ãããããå°ãæ±ããæããããŸãã ããããããã¯äœ¿çšããã®ã«æ·±å»ãªãããã«ãŒã§ã¯ãããŸããã
äºæž¬ã«é¢ããããã€ãã®åé¡ãçºçããŠããŸãã 2ã€ã®è³ªåïŒ
Booster.predict
è€æ°ååŒã³åºãããšã¯ã§ããŸããïŒrabit.init
ã Booster.predict
ã rabit.finalize
ãå¥ã
ã®ã¹ã¬ããã§åŒã³åºãããšã¯ã§ããŸããïŒçŸåšãæ°ãããã©ãã«ãŒãäœæããã¯ãŒã«ãŒã®ã¡ã€ã³ã¹ã¬ããã§rabit.init
ãåŒã³åºããŠããŸãã ããã¯æ£åžžã«æ©èœããŸãã ãã ããã¯ãŒã«ãŒã¹ã¬ããã§Booster.predict
ãåŒã³åºããšïŒådaskã¯ãŒã«ãŒã¯èšç®çšã®ã¹ã¬ããããŒã«ãç¶æããŸãïŒã Doing rabit call after Finalize
ã®ãããªãšã©ãŒãçºçããŸãã äœããå§ãã¯ãããŸããïŒ
äºæž¬ã®äžéšã¯äŸç¶ãšããŠrabitã䜿çšããŠããŸããããã¯äž»ã«ãäºæž¬åããã¬ãŒãã³ã°ãšå ±æãããããã€ãã®åæåã«ãŒãã³ã§åŠç¿è ã䜿çšããŠããããã§ãã æçµçã«ã¯ãããä¿®æ£ããå¿ èŠããããŸãããä»ã®ãšãããããåœãŠã¯ãŸããŸãã
ç§ã¯ããã«ã€ããŠèå³ããããŸãã ãã¬ãŒãã³ã°æžã¿ã¢ãã«ãã¯ãŒã«ãŒããã¯ã©ã€ã¢ã³ããã·ã³ã«ã·ãªã¢ã«å-転é-éã·ãªã¢ã«åããåŸãã©ããããããã¯ãŒã¯ããªããŠããéåžžã®ããŒã¿ã§æ£åžžã«åäœããããã§ãã Rabitã§ãã¬ãŒãã³ã°ãããã¢ãã«ã䜿çšããŠãRabitãªãã§ããŒã¿ãäºæž¬ã§ããããã§ãã ãããæ¬çªã§å¿ èŠãªããã§ãã ããã§ããŠãµã®ã®èšç·Žãåããã¢ãã«ã䜿çšããéã®å¶çŽã«ã€ããŠè©³ããæããŠãã ããã
ããŒã¿ã»ãã/åé¡ã®äŸ
äžèšã®ãã¹ãŠãæ£ãããšä»®å®ãããšã人ã ããã¢ã³ã¹ãã¬ãŒã·ã§ã³ã«äœ¿çšããæšæºã®åæ£ãã¬ãŒãã³ã°ã®äŸã¯ãããŸããïŒ
ãã®å®éšã®çµæãåçŸã§ããã°å¹žãã§ãã
https://github.com/Microsoft/LightGBM/wiki/Experiments#parallel -experiment
XGBoostïŒïŒ1950ïŒã®æ°ããããã³ã°+é«éå±¥æŽãªãã·ã§ã³ã䜿çšãããšãåæ§ã®çµæãåŸãããšãã§ããã¯ãã§ãã
è©ŠããŠã¿ãå žåçãªããã¡ãã®äŸã¯https://github.com/dmlc/xgboost/tree/master/demo/dataã«ãããŸã
ããã¯libsvm圢åŒã§ãããnumpyã«ããã«ã¯å°ã解æããå¿ èŠããããŸã
sklearnã§ãã®PRã«èå³ããããããããŸããïŒ https ïŒ//github.com/scikit-learn/scikit-learn/pull/935
@mrocklinã¢ãã«ã®åå©çšã«å¶çŽã¯ãããŸããã ãããã£ãŠãåæ£ããŒãžã§ã³ã§ãã¬ãŒãã³ã°ãããã¢ãã«ã¯ãã·ãªã¢ã«ããŒãžã§ã³ã§äœ¿çšã§ããŸãã äºæž¬åã®çŸåšã®å¶éïŒrabitã§ã³ã³ãã€ã«ããå ŽåïŒããã¬ãŒãã³ã°é¢æ°ãšæ©èœãæ··åããŠããïŒã€ãŸããrabitåŒã³åºããçºçããïŒã ãã§ãã
ããªãããããèšããšãç§ãã¡ã¯åé¡ã®è§£æ±ºçããããããããªããšæããŸãã äºæž¬ãåé¡ã解決ããåã«ãïŒäœãæž¡ããã«ãäºæž¬è
ã«ãããå¯äžã®ã¯ãŒã«ãŒã§ãããšæãããïŒ rabit.init
ãå®è¡ããã ãã§ãã
ã¯ãã 確ãã«ããã¯åé¡ã解決ããŸãã dask-xgboostãpredictããµããŒãããããã«ãªããŸããïŒ https ïŒ//github.com/mrocklin/dask-xgboost/commit/827a03d96977cda8d104899c9f42f52dac446165
åé¿ç@tqchenãããããšãïŒ
ããã¯ãç§ã®ããŒã«ã«ã©ãããããäžã®èªç©ºäŒç€Ÿã®ããŒã¿ã»ããã®å°ããªãµã³ãã«ã§ã®dask.dataframeãšxgboostã®ã¯ãŒã¯ãããŒã§ãã ããã¯èª°ã«ãšã£ãŠã倧äžå€«ã§ããïŒ ããã§æ¬ èœããŠããXGBoostã®APIèŠçŽ ã¯ãããŸããïŒ
In [1]: import dask.dataframe as dd
In [2]: import dask_xgboost as dxgb
In [3]: df = dd.read_csv('train-0.1m.csv')
In [4]: df.head()
Out[4]:
Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance \
0 c-8 c-21 c-7 1934 AA ATL DFW 732
1 c-4 c-20 c-3 1548 US PIT MCO 834
2 c-9 c-2 c-5 1422 XE RDU CLE 416
3 c-11 c-25 c-6 1015 OO DEN MEM 872
4 c-10 c-7 c-6 1828 WN MDW OMA 423
dep_delayed_15min
0 N
1 N
2 N
3 N
4 Y
In [5]: labels = df.dep_delayed_15min == 'Y'
In [6]: del df['dep_delayed_15min']
In [7]: df = df.categorize()
In [8]: df = dd.get_dummies(df)
In [9]: data_train, data_test = df.random_split([0.9, 0.1], random_state=123)
In [10]: labels_train, labels_test = labels.random_split([0.9, 0.1], random_state=123)
In [11]: from dask.distributed import Client
In [12]: client = Client() # in a large-data situation I probably should have done this before calling categorize above (which requires computation)
In [13]: param = {} # Are there better choices for parameters?
In [14]: bst = dxgb.train(client, {}, data_train, labels_train)
[14:00:46] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:50] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 122 extra nodes, 0 pruned nodes, max_depth=6
[14:00:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:00:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 114 extra nodes, 0 pruned nodes, max_depth=6
[14:00:59] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:01:01] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:01:04] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 0 pruned nodes, max_depth=6
[14:01:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6
In [15]: bst
Out[15]: <xgboost.core.Booster at 0x7f689803af60>
In [16]: predictions = dxgb.predict(client, bst, data_test)
In [17]: predictions
Out[17]:
Dask Series Structure:
npartitions=1
None float32
None ...
Name: predictions, dtype: float32
Dask Name: _predict_part, 9 tasks
ç§ã®çæçãªç®æšã¯ãããã«ã€ããŠçãããã°æçš¿ãæžãããšã§ããããããã°ãXGBoostã®çµéšãè±å¯ã§ãæéã®ããä»ã®èª°ãããã®ãããžã§ã¯ããæ¡çšããŠæšé²ã§ããããã«ãªããŸãã ïŒç§ã¯ãããã«ããä»ã®ã¿ããªãšåãããã«ããã®ãããªä»ã®ããã€ãã®ãããžã§ã¯ãã«åæã«åãçµãã§ããŸããïŒ
ãã§ã«S3ãã±ããã«ãããšããçç±ã ãã§ãèªç©ºäŒç€Ÿã®ããŒã¿ã»ããã«åã£ãŠããŸãã ç§ã¯ãCriteoããŒã¿ã»ããã倧èŠæš¡ã§ããè¯ããã¢ã³ã¹ãã¬ãŒã·ã§ã³ãè¡ãããšã«åæããŸãã
䜿çšãããã©ã¡ãŒã¿ãçµæã®å€ææ¹æ³ããŸã ããããŸããã ãã©ã¡ãŒã¿ã«ã€ããŠã¯ã @ szilardã®å®éšãããã§äœ¿çšã§ããŸãã äºæž¬ãå€æããè¯ãæ¹æ³ã¯ãããŸããïŒ ããšãã°ã$ïŒ$ labels_test
$ïŒ$ãšäžèŽããpredictions > 0.5
ãæ¢ããŠããŸããïŒ
ããããããã€ããªåé¡ã®äºæž¬ããã©ãŒãã³ã¹ãè©äŸ¡ããæãäžè¬çãªæ¹æ³ïŒç¹ã«ç 究ãŸãã¯ç«¶äºã®èšå®ã§ïŒã¯ãROCæ²ç·ïŒAUCïŒã®äžã®é åã䜿çšããããšã§ãããå®éã®ã¢ããªã±ãŒã·ã§ã³ã§ã¯ããããžãã¹ãå€ã«åãããã¡ããªãã¯ã䜿çšããå¿ èŠããããŸãã¢ãã«ã䜿çšããŠäœæãããŸããã
ããšãã°ãlabels_testã«äžèŽãã0.5ãè¶ ããäºæž¬ãæ¢ããŠããŸããïŒ
ã¯ãã ãã¹ãã»ããã§ãããå¹³åãããšãããããã¹ãã®ç²ŸåºŠã«ãªããŸãã ãã ããããŒã¿ã»ããã®ãã©ã³ã¹ã厩ããŠããå¯èœæ§ããããŸãïŒã¯ãªãã¯ãããã¯ãªãã¯ããªãæ¹ãã¯ããã«å€ãïŒã ãã®å Žåã ROCAUCã¹ã³ã¢ãããé©åãªã¡ããªãã¯ã§ãã
from sklearn.metrics import roc_auc_score
print(roc_auc_score(labels_test, predictions))
predictions
ãããã¹ãã»ããã®åè¡ã®ã¢ãã«ã«ãã£ãŠæšå®ãããæ£ã®ç¢ºçã®1Dé
åã§ãããšä»®å®ããŸãã
@mrocklinãã©ããŒã¢ããã®è³ªåã®1ã€ã§ãããdaskã¯ãã«ãã¹ã¬ããã¯ãŒã«ãŒãžã§ããèš±å¯ããŸããïŒ ç§ã¯ãããGILã®ããã«Pythonã«ããŸãé¢ä¿ããªãããšãç¥ã£ãŠããŸãã ãã ããxgboostã䜿çšãããšãåæ£ããŠçžäºã«èª¿æŽããªãããã¯ãŒã«ãŒããšã«ãã«ãã¹ã¬ãããã¬ãŒãã³ã°ãå®è¡ã§ããŸãã xgboostã®nthreadåŒæ°ã¯ãåžžã«ãã®ã¯ãŒã«ãŒã®åäœäžã®ã³ã¢ã®æ°ã«ãªãããã«èšå®ããå¿ èŠããããŸã
ç°¡åãªçãã¯ãã¯ããã§ãã Daskã®ã»ãšãã©ã®çšéã¯ãNumPyãPandasãSKLearnãªã©ãã»ãšãã©ãCããã³Fortranã³ãŒãã§ãããPythonã§ã©ãããããŠãããããžã§ã¯ãã§ãã GILã¯ãããã®ã©ã€ãã©ãªã«åœ±é¿ãäžããŸããã äžéšã®äººã ã¯ãPySpark RDDïŒ dask.bagãåç §ïŒãšåæ§ã®ã¢ããªã±ãŒã·ã§ã³ã«Daskã䜿çšããŠããã圱é¿ãåããŸãã ãã ãããã®ã°ã«ãŒãã¯å°æ°æŽŸã§ãã
ããã§ããDaskã¯ãã«ãã¹ã¬ããã¿ã¹ã¯ãèš±å¯ããŸãã XGBoostã«è€æ°ã®ã¹ã¬ããã䜿çšããããã«æ瀺ããã«ã¯ã©ãããã°ããã§ããïŒ ãããŸã§ã®ç§ã®å®éšã§ã¯ããã©ã¡ãŒã¿ãå€æŽããã«CPUã®äœ¿çšçãé«ãããšãããããŸãããããã§ãããã©ã«ãã§ãã¹ãŠãããŸãæ©èœããã®ã§ããããã
XGBoostã¯ããã©ã«ãã§ãã«ãã¹ã¬ããã䜿çšããnthreadãèšå®ãããŠããªãå Žåããã·ã³äžã§ïŒãã®ã¯ãŒã«ãŒã§ã¯ãªãïŒäœ¿çšå¯èœãªãã¹ãŠã®CPUã¹ã¬ããã䜿çšããŸãã ããã«ãããè€æ°ã®ã¯ãŒã«ãŒãåããã·ã³ã«å²ãåœãŠãããŠããå Žåã«ç«¶åç¶æ ãçºçããå¯èœæ§ããããŸãã
ãããã£ãŠãnthreadãã©ã¡ãŒã¿ãŒãã¯ãŒã«ãŒã䜿çšã§ããã³ã¢ã®æ倧æ°ã«èšå®ããããšã¯åžžã«è¯ãããšã§ãã éåžžãè¯ãç¿æ £ã¯ãåŽåè ããšã«çŽ4ã¹ã¬ããã䜿çšããããšã§ã
確ãã«ãã§éæããå¿
èŠããããŸã
https://github.com/mrocklin/dask-xgboost/commit/c22d066b67cââ78710d5ad99b8620edc55182adc8f
2017幎2æ20æ¥æææ¥ååŸ6æ31åãTianqi [email protected]
æžããŸããïŒ
XGBoostã¯ããã©ã«ãã§ãã«ãã¹ã¬ããã䜿çšããå©çšå¯èœãªãã¹ãŠã®CPUã䜿çšããŸã
nthreadãèšå®ãããŠããªãå ŽåãïŒãã®ã¯ãŒã«ãŒã§ã¯ãªãïŒãã·ã³äžã®ã¹ã¬ããã
ããã«ãããè€æ°ã®ã¯ãŒã«ãŒãåãã«å²ãåœãŠãããŠããå Žåã«ç«¶åç¶æ ãçºçããå¯èœæ§ããããŸã
æ©æ¢°ããããã£ãŠãnthreadãã©ã¡ãŒã¿ãæ倧æ°ã«èšå®ããããšã¯åžžã«è¯ãããšã§ãã
ã¯ãŒã«ãŒã䜿çšãèš±å¯ãããã³ã¢ã éåžžãè¯ãç¿æ £ã¯èšãåšãã®äœ¿çšã§ã
ã¯ãŒã«ãŒããã4ã¹ã¬ããâ
ããªããèšåãããã®ã§ãããªãã¯ãããåãåã£ãŠããŸãã
ãã®ã¡ãŒã«ã«çŽæ¥è¿ä¿¡ããGitHubã§è¡šç€ºããŠãã ãã
https://github.com/dmlc/xgboost/issues/2032#issuecomment-281205747 ããŸãã¯ãã¥ãŒã
ã¹ã¬ãã
https://github.com/notifications/unsubscribe-auth/AASszPELRoeIvqEzyJhkKumIs-vd0PHiks5reiJngaJpZM4L_PXa
ã
ããŒãããã¯ïŒ https ïŒ//gist.github.com/19c89d78e34437e061876a9872f4d2df
çãã¹ã¯ãªãŒã³ãã£ã¹ãïŒ6åïŒïŒ https ïŒ//youtu.be/Cc4E-PdDSro
éèŠãªãã£ãŒãããã¯ã¯å€§æè¿ã§ãã ç¹°ãè¿ãã«ãªããŸããããã®åéã§ã®ç§ã®ç¡ç¥ãèš±ããŠãã ããã
@mrocklinçŽ æŽããããã¢ïŒ param dictã§'tree_method': 'hist', 'grow_policy': 'lossguide'
ã䜿çšããããšã§ãå®è¡æã®ããã©ãŒãã³ã¹ïŒããã³å Žåã«ãã£ãŠã¯ã¡ã¢ãªäœ¿çšéïŒã倧å¹
ã«æ¹åã§ãããšæããŸãã
@ogriselã«æè¬ããŸãã ãããã®ãã©ã¡ãŒã¿ã䜿çšãããšããã¬ãŒãã³ã°æéã¯6åãã1åã«ãªããŸãã ãã ããã¡ã¢ãªäœ¿çšéã¯ã»ãŒåãããã§ãã
OKãããã«æ»ããŸãã ãã¬ãŒãã³ã°ãšå®è£ 以å€ã®XGBoostæäœã¯ãããŸããïŒ
@tqchenãŸãã¯@ogriselã®ãããããã httpsïŒ//github.com/mrocklin/dask-xgboost/blob/master/dask_xgboost/core.pyã§å®è£ ã確èªããæéãããã°ãæè¬ããŸãã ãã ããå€åœã®ã³ãŒãããŒã¹ã調ã¹ãããšã¯ãåªå é äœãªã¹ãã§åžžã«é«ããšã¯éããªãããšãç解ããŠããŸãã
ãã¹ãŠåé¡ããªããã°ãREADMEã«ããå°ãè¿œå ããŠãPyPIã«å ¬éããŸããããããã°ããããããã®åé¡ã解決ã§ããŸãã
ãã¬ãŒãã³ã°ãšäºæž¬ã ããé åžããå¿ èŠããããšæããŸãã ä»ã®ãã®ã¯ããŒã¿ã»ããã«å¿çããªããããé åžããå¿ èŠã¯ãããŸãã
dask-xgboostãPyPIã«ããã·ã¥ãã httpsïŒ//github.com/dask/dask-xgboostã«ç§»åããŸãã
ããã§ããªãã®å©ããããããšã@tqchenãš@ogrisel ã ã³ã©ãã¬ãŒã·ã§ã³ã«ãããããã¯æ¯èŒçç°¡åã«ãªããŸããã
ãã³ãããŒã¯ãå®è¡ãããå Žåã¯ãåãã§ãæäŒããããŠããã ããŸãã ãããŸã§ã¯ãç· ãããããŸãã
æãåèã«ãªãã³ã¡ã³ã
ããŒãããã¯ïŒ https ïŒ//gist.github.com/19c89d78e34437e061876a9872f4d2df
çãã¹ã¯ãªãŒã³ãã£ã¹ãïŒ6åïŒïŒ https ïŒ//youtu.be/Cc4E-PdDSro
éèŠãªãã£ãŒãããã¯ã¯å€§æè¿ã§ãã ç¹°ãè¿ãã«ãªããŸããããã®åéã§ã®ç§ã®ç¡ç¥ãèš±ããŠãã ããã