xgboost 🚀 - 使用 Dask 进行分布式计算

@mrocklin我认为 Dask 与 sklearn 集成。您是否查看了我们的 sklearn 包装器以查看它是否可以使用？

terrytangyuan 于 2017-02-13

有意义地与分布式系统集成通常必须在每个算法级别而不是在库级别完成。 SKLearn 和 Dask 在某些方面可以互相帮助，是的，但并不是特别深入。

mrocklin 于 2017-02-13

Dask 数据框将是一个好的开始。在我们的代码库中，我们检查了 pandas 数据框。这可能是 dask 数据框适合作为开始的地方。

terrytangyuan 于 2017-02-13

那么如果有人带着一个多 TB 的 dask 数据帧到达会发生什么？您是否只是将其转换为 Pandas 并继续？或者有没有办法在集群中智能地并行化 XGBoost，指向构成 dask 数据帧的各种 pandas 数据帧？

mrocklin 于 2017-02-13

用户可以指定批量大小吗？我想用户可以通过 partial_fit 受益。

cc @tqchen更熟悉代码的分布式部分。

terrytangyuan 于 2017-02-13

xgboost 的分布式版本可以挂接到分布式作业启动器中，理想情况下将数据分区馈送到 xgboost 然后继续。

@mrocklin我认为最相关的部分是 xgboost-spark 和 xgboost-flink 模块，它们将 xgboost 嵌入到 spark/flink 的 mapPartition 函数中。我想Dask会有类似的东西

xgboost 方面的要求是 XGBoost 通过 rabit 处理进程间连接，并且需要从客户端启动一个跟踪器（连接每个作业）。

tqchen 于 2017-02-13

请参阅https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala#L112中的相关代码

Rabit被设计为嵌入到其他分布式系统中，所以我认为在python端进行调整可能不会太难。

tqchen 于 2017-02-13

从 Dask 启动其他分布式系统通常是非常可行的。如何将数据从托管分布式系统（spark/flink/dask）移动到 xg-boost？或者这是针对小数据的分布式训练？

mrocklin 于 2017-02-13

更具体地说，我希望构建一个系统如下：

在每个 dask 工作人员上，我都会启动一个 Rabit 服务器。 Dask 为这些 Rabit 服务器提供了足够的信息来找到彼此。
我在每个代表当前训练模型的工作人员上创建了一些本地 XGBoost 状态
我反复喂这个 per-worker 对象 pandas 数据帧或 numpy 数组
我聆听来自 XGBoost 的一些信号，告诉我停止

这符合你的预期吗？您是否容易将我指向相关的 Python API？

mrocklin 于 2017-02-13

是的，请在此处查看相关信息https://github.com/dmlc/xgboost/blob/master/tests/distributed/以获取 python API。

您还需要做的是在驱动程序端启动一个兔子跟踪器（可能是驱动 dask 的地方），这是在 dmlc-submit 脚本中完成的https://github.com/dmlc/dmlc-core /tree/master/tracker/dmlc_tracker

tqchen 于 2017-02-15

好的，填写我之前的大纲：

在运行任何 XGBoost 代码之前，我们设置了一个 Rabit 网络

在驱动程序/调度程序节点上，我们启动一个兔子跟踪器

envs = {'DMLC_NUM_WORKER' : nworker,
        'DMLC_NUM_SERVER' : nserver}

rabit = RabitTracker(hostIP=ip_address, nslave=num_workers)
envs.update(rabit.slave_envs())
rabit.start(args.num_workers)  # manages connections in background thread

我也可能会通过类似的过程来启动PSTracker 。这应该在同一台集中式机器上还是应该在网络中的其他地方？应该有几个吗？这应该是用户可配置的吗？

最终我让我的跟踪器（和 pstrackers？）加入了 rabit 网络并阻止。

rabit.join()  # join network

在工作节点上，我需要将这些环境变量（我将通过正常的 dask 通道移动）转储到本地环境中。然后只需调用xgboost.rabit.init()就足够了

import os
os.environ.update(envs)
xgboost.rabit.init()

查看 Rabit 代码，环境变量似乎是提供此信息的唯一方法。你能证实这一点吗？有没有办法提供跟踪器主机/端口信息作为直接输入？

训练

然后我将我的 numpy 数组/pandas 数据帧/scipy 稀疏数组转换为 DMatrix 对象，这看起来相对简单。但是，我可能每个工人都有几批数据。有没有一种干净的方法可以使用更多数据多次调用 train ？我担心这些行的评论：

# Run training, all the features in training API is available.
# Currently, this script only support calling train once for fault recovery purpose.
bst = xgb.train(param, dtrain, num_round, watchlist, early_stopping_rounds=2)

在开始训练之前，我们是否需要等待所有数据到达？

示例数据集/问题

假设我上面的一切都是正确的，那么是否有一个标准的分布式训练示例可供人们用于演示？

mrocklin 于 2017-02-15

无需启动 pstracker。

Tracker只需要在一个地方启动，可能在调度器（驱动程序）上，它没有数据繁重的工作，只用于连接工作。
env args 可以在rabit.init中作为 kwargs 传递
由于树提升是一种批处理算法，我们确实需要在开始训练之前等待所有数据都被摄取。
- 但是请注意，每个工作人员只需要获取数据的一个分片（行的子集）。
- 理想情况下，我们应该使用数据迭代器接口将数据以小批量方式传递到 DMatrix 中，因此整个数据集不必位于内存中
- 这是通过https://github.com/dmlc/xgboost/blob/master/include/xgboost/c_api.h#L117完成的，它还没有 python 包装器。
- 对于第一个解决方案，我建议直接通过数组

tqchen 于 2017-02-15

今天早上我有时间玩这个。结果在这里： https ://github.com/mrocklin/dask-xgboost

到目前为止，它只处理单个内存数据集的分布式学习。出现了一些问题：

序列化和传递 DMatrix 对象的最佳方法是什么？
序列化和返回 Booster 结果的最佳方法是什么？
上面列出的环境变量如何映射到rabit.init中的参数？ rabit.init的预期输入形式究竟是什么？将slave_envs()的结果传递给 rabit.init 显然是行不通的，因为它需要一个列表。我们是否应该将每个键名转换为--key ，或者删除DMLC前缀并转换为小写？
有没有测试正确性的好方法？我们如何比较两个 Booster 对象？我们是否应该期望分布式训练产生完全相同的结果和顺序训练？

mrocklin 于 2017-02-18

您通常不会序列化 DMatrix，它更像是一个训练时间数据持有者，我假设数据由 dask（数组/数据帧）传递和共享，然后传递给 xgboost
- 我们可以探索更好的方法来传递数据，而不是直接通过内存数组，可能通过将数据迭代器暴露给 xgboost
只要两边都安装了 xgboost，就可以腌制 Booster。
抱歉没有详细说明事情是如何通过的，应该是

rabit.init(['DMLC_KEY1=VALUE1', 'DMLC_KEY2=VALUE2']

通常从分布式和单机训练的助推器是不一样的，但这里有几件事要检查
- 从所有工人返回的助推器应该是相同的
- 寻找预测验证错误，它应该大约低至单个机器案例

tqchen 于 2017-02-18

关于如何使用它的另外两个问题（我对 XGBoost 没有经验，对机器学习只有一点经验，请原谅我的无知）。

在相同的输入数据上使用多个工作人员是否合理？（XGBoost 受计算约束？）
如果我们在更大的数据集上进行操作，我是否必须做任何特别的事情来告诉每个 XGBoost 工作人员其数据与其他工作人员不同？

哪个用例更常见？

mrocklin 于 2017-02-18

每项工作都应该处理不同的数据分区（按行），它们不应该查看相同的输入数据。

如果数据不够大，多线程版本应该做
每个作品都会分别收集各自分区的统计数据并相互同步

这通常对应于 spark/flink 等框架中的 mapPartition 操作

假设我的数据集有 8 行 4 列，如果我们启动两个工人

工人 0 从第 0-3 行读取
工人 1 从第 4 行 -7 读取

tqchen 于 2017-02-18

好的，现在上面的东西更干净了。如果我们有能力在每个工作人员上生成结果时使用这些结果，那就太好了，但我们现在已经解决了这个问题。这是当前的解决方案：

在集群上持久化 dask 数组或数据帧，等待它完成
查找每个块/分区结束的位置
告诉每个工人准确地连接这些块/分区并对其进行训练

此解决方案似乎是可管理的，但并不理想。如果 xgboost-python 可以在结果到达时接受结果会很方便。但是我认为接下来要做的就是在实践中尝试一下。

我将在互联网上四处寻找示例。如果有人碰巧遇到人为问题，我可以使用 numpy 或 pandas API 轻松生成，这将是受欢迎的。在那之前，这是我笔记本电脑上的一个简单示例，其中包含随机数据：

In [1]: import dask.dataframe as dd

In [2]: df = dd.demo.make_timeseries('2000', '2001', {'x': float, 'y': float, 'z': int}, freq='1s', partition_freq=
   ...: '1D')  # some random time series data

In [3]: df.head()
Out[3]: 
                            x         y     z
2000-01-01 00:00:00  0.778864  0.824796   977
2000-01-01 00:00:01 -0.019888 -0.173454  1023
2000-01-01 00:00:02  0.552826  0.051995  1083
2000-01-01 00:00:03 -0.761811  0.780124   959
2000-01-01 00:00:04 -0.643525  0.679375   980

In [4]: labels = df.z > 1000

In [5]: del df['z']

In [6]: df.head()
Out[6]: 
                            x         y
2000-01-01 00:00:00  0.778864  0.824796
2000-01-01 00:00:01 -0.019888 -0.173454
2000-01-01 00:00:02  0.552826  0.051995
2000-01-01 00:00:03 -0.761811  0.780124
2000-01-01 00:00:04 -0.643525  0.679375

In [7]: labels.head()
Out[7]: 
2000-01-01 00:00:00    False
2000-01-01 00:00:01     True
2000-01-01 00:00:02     True
2000-01-01 00:00:03    False
2000-01-01 00:00:04    False
Name: z, dtype: bool

In [8]: from dask.distributed import Client

In [9]: c = Client()  # creates a local "cluster" on my laptop

In [10]: from dask_xgboost import train
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [11]: param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}  # taken from example

In [12]: bst = train(c, param, df, labels)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'
[14:46:20] Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'

In [13]: bst
Out[13]: <xgboost.core.Booster at 0x7fbaacfd17b8>

mrocklin 于 2017-02-18

如果有人想看，相关代码在这里： https ://github.com/mrocklin/dask-xgboost/blob/master/dask_xgboost/core.py

正如我所说，我是 XGBoost 的新手，所以我可能遗漏了一些东西。

mrocklin 于 2017-02-18

一个典型的玩具示例在https://github.com/dmlc/xgboost/tree/master/demo/data
虽然它是 libsvm 格式，但需要一些解析才能将其转换为 numpy

tqchen 于 2017-02-18

更大的东西（你实际上需要一个集群）？或者有没有一种标准的方法来生成任意大小的数据集？

mrocklin 于 2017-02-18

或者，也许更好的问题是：“您（或其他阅读本期的人）希望在这里看到什么？”

mrocklin 于 2017-02-18

立即构建预测。如果我将模型移回工作人员（通过 pickle/unpickle 过程），然后在某些数据上调用bst.predict ，我会收到以下错误：

Doing rabit call after Finalize

我的假设是，在这一点上，模型是独立的，不再需要使用 rabit。它似乎在客户端机器上工作正常。有什么想法为什么我在调用predict时可能会收到此错误？

mrocklin 于 2017-02-18

predict 的某些部分仍然使用 rabit，主要是因为 predictor 仍然使用带有一些与训练共享的初始化例程的学习器。最终这应该被修复，但现在是这种情况。

tqchen 于 2017-02-18

我认为只要它适用于通用数据集，它就是一个有趣的起点。

无论如何都有理由将集群用于中等数据（易于在集群环境中调度），如果我们稍微宣传一下，一些 pyspark 用户可能有兴趣尝试一下

在真正重要的数据集上进行测试很困难，例如（尝试 1 个具有 10 亿行的数据集）。 Kaggle 可能是一些相关的大数据集，大约有 1000 万。

tqchen 于 2017-02-18

这个存储库展示了针对航空公司数据集的实验，我认为这些数据集在数千万行和数万列中（一次热编码后的数千个？）对于他们的基准测试，看起来他们取了 10 万行样本并人工生成来自该样本的更大数据集。如果有必要，我们大概可以扩大规模。

这是一个在单核上使用 pandas 和 xgboost 数据的示例。欢迎任何有关数据准备、参数或如何正确执行此操作的建议。

In [1]: import pandas as pd

In [2]: df = pd.read_csv('train-0.1m.csv')

In [3]: df.head()
Out[3]: 
  Month DayofMonth DayOfWeek  DepTime UniqueCarrier Origin Dest  Distance  \
0   c-8       c-21       c-7     1934            AA    ATL  DFW       732   
1   c-4       c-20       c-3     1548            US    PIT  MCO       834   
2   c-9        c-2       c-5     1422            XE    RDU  CLE       416   
3  c-11       c-25       c-6     1015            OO    DEN  MEM       872   
4  c-10        c-7       c-6     1828            WN    MDW  OMA       423   

  dep_delayed_15min  
0                 N  
1                 N  
2                 N  
3                 N  
4                 Y  

In [4]: labels = df.dep_delayed_15min == 'Y'

In [5]: del df['dep_delayed_15min']

In [6]: df = pd.get_dummies(df)

In [7]: len(df.columns)
Out[7]: 652

In [8]: import xgboost as xgb
/home/mrocklin/Software/anaconda/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [9]: dtrain = xgb.DMatrix(df, label=labels)

In [10]: param = {}  # Are there better choices for parameters?  I could use help here

In [11]: bst = xgb.train(param, dtrain)  # or other parameters here?
[17:50:28] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 124 extra nodes, 0 pruned nodes, max_depth=6
[17:50:30] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[17:50:32] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[17:50:33] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 116 extra nodes, 0 pruned nodes, max_depth=6
[17:50:35] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 112 extra nodes, 0 pruned nodes, max_depth=6
[17:50:36] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 114 extra nodes, 0 pruned nodes, max_depth=6
[17:50:38] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 106 extra nodes, 0 pruned nodes, max_depth=6
[17:50:39] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 116 extra nodes, 0 pruned nodes, max_depth=6
[17:50:41] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 104 extra nodes, 0 pruned nodes, max_depth=6
[17:50:43] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 100 extra nodes, 0 pruned nodes, max_depth=6

In [12]: test = pd.read_csv('test.csv')

In [13]: test.head()
Out[13]: 
  Month DayofMonth DayOfWeek  DepTime UniqueCarrier Origin Dest  Distance  \
0   c-7       c-25       c-3      615            YV    MRY  PHX       598   
1   c-4       c-17       c-2      739            WN    LAS  HOU      1235   
2  c-12        c-2       c-7      651            MQ    GSP  ORD       577   
3   c-3       c-25       c-7     1614            WN    BWI  MHT       377   
4   c-6        c-6       c-3     1505            UA    ORD  STL       258   

  dep_delayed_15min  
0                 N  
1                 N  
2                 N  
3                 N  
4                 Y  

In [14]: test_labels = test.dep_delayed_15min == 'Y'

In [16]: del test['dep_delayed_15min']

In [17]: test = pd.get_dummies(test)

In [18]: len(test.columns)  # oops, looks like the columns don't match up
Out[18]: 670

In [19]: dtest = xgb.DMatrix(test)

In [20]: predictions = bst.predict(dtest)  # this fails because of mismatched columns

无论如何，这是一个选择。航空公司数据集似乎众所周知，在实践中可能会非常大。再说一次，机器学习不是我的专长，所以我不知道这是否合适。

cc @TomAugspurger ，他似乎是那种可能对此有想法的人。

mrocklin 于 2017-02-18

关于 Dask 和 predict，我总是可以再次设置 rabit。这感觉有点不干净，因为它强制评估而不是让事情变得懒惰。但这并不是一个严重的阻塞器。

mrocklin 于 2017-02-18

遇到一些预测问题。两个问题：

我可以在同一个 rabit 会话中多次调用Booster.predict吗？
我可以在不同的线程上调用rabit.init 、 Booster.predict和rabit.finalize吗？

目前我创建了一个新的跟踪器，并在工作线程的主线程上调用rabit.init 。这工作正常。但是，当我在工作线程中调用Booster.predict （每个 dask 工作人员维护一个用于计算的线程池）时，我会收到类似Doing rabit call after Finalize的错误。有什么建议吗？

mrocklin 于 2017-02-19

predict 的某些部分仍然使用 rabit，主要是因为 predictor 仍然使用带有一些与训练共享的初始化例程的学习器。最终这应该被修复，但现在是这种情况。

我很好奇这个。在我们序列化-传输-反序列化从工作人员到我的客户端机器的训练模型之后，即使没有 rabit 网络，它似乎也可以在正常数据上正常工作。似乎用 Rabit 训练的模型可以用来预测没有 rabit 的数据。这似乎在生产中也是必要的。您能在这里多说一下使用 rabit-trained 模型的限制吗？

mrocklin 于 2017-02-20

示例数据集/问题
假设我上面的一切都是正确的，那么是否有一个标准的分布式训练示例可供人们用于演示？

我很高兴重现这个实验的结果：

https://github.com/Microsoft/LightGBM/wiki/Experiments#parallel -experiment

使用 XGBoost (#1950) 的新 binning + fast hist 选项，应该可以获得类似的结果。

ogrisel 于 2017-02-20

一个典型的玩具示例在https://github.com/dmlc/xgboost/tree/master/demo/data
虽然它是 libsvm 格式，但需要一些解析才能将其转换为 numpy

你可能对 sklearn 中的这个 PR 感兴趣： https ://github.com/scikit-learn/scikit-learn/pull/935

ogrisel 于 2017-02-20

@mrocklin对模型的重用没有限制。所以在分布式版本中训练的模型可以在串行版本中使用。只是预测器的当前限制（使用 rabit 编译时）与训练函数混合了功能（因此发生了 rabit 调用）。

既然你这么说，我想我们可能有解决这个问题的办法。只需在 predict 解决问题之前执行rabit.init （不传递任何内容，并使预测器认为它是唯一的工人）

tqchen 于 2017-02-20

是的。确实可以解决问题。 dask-xgboost 现在支持预测： https ://github.com/mrocklin/dask-xgboost/commit/827a03d96977cda8d104899c9f42f52dac446165

感谢@tqchen的解决方法！

mrocklin 于 2017-02-20

这是一个使用 dask.dataframe 和 xgboost 在我本地笔记本电脑上的航空公司数据集的小样本上的工作流。这对每个人来说都可以吗？我在这里缺少 XGBoost 的 API 元素吗？

In [1]: import dask.dataframe as dd

In [2]: import dask_xgboost as dxgb

In [3]: df = dd.read_csv('train-0.1m.csv')

In [4]: df.head()
Out[4]: 
  Month DayofMonth DayOfWeek  DepTime UniqueCarrier Origin Dest  Distance  \
0   c-8       c-21       c-7     1934            AA    ATL  DFW       732   
1   c-4       c-20       c-3     1548            US    PIT  MCO       834   
2   c-9        c-2       c-5     1422            XE    RDU  CLE       416   
3  c-11       c-25       c-6     1015            OO    DEN  MEM       872   
4  c-10        c-7       c-6     1828            WN    MDW  OMA       423   

  dep_delayed_15min  
0                 N  
1                 N  
2                 N  
3                 N  
4                 Y  

In [5]: labels = df.dep_delayed_15min == 'Y'

In [6]: del df['dep_delayed_15min']

In [7]: df = df.categorize()

In [8]: df = dd.get_dummies(df)

In [9]: data_train, data_test = df.random_split([0.9, 0.1], random_state=123)

In [10]: labels_train, labels_test = labels.random_split([0.9, 0.1], random_state=123)

In [11]: from dask.distributed import Client

In [12]: client = Client()  # in a large-data situation I probably should have done this before calling categorize above (which requires computation)

In [13]: param = {}  # Are there better choices for parameters?

In [14]: bst = dxgb.train(client, {}, data_train, labels_train)
[14:00:46] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:48] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:50] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 122 extra nodes, 0 pruned nodes, max_depth=6
[14:00:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:00:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 120 extra nodes, 0 pruned nodes, max_depth=6
[14:00:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 114 extra nodes, 0 pruned nodes, max_depth=6
[14:00:59] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:01:01] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 118 extra nodes, 0 pruned nodes, max_depth=6
[14:01:04] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 94 extra nodes, 0 pruned nodes, max_depth=6
[14:01:06] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 102 extra nodes, 0 pruned nodes, max_depth=6

In [15]: bst
Out[15]: <xgboost.core.Booster at 0x7f689803af60>

In [16]: predictions = dxgb.predict(client, bst, data_test)

In [17]: predictions
Out[17]: 
Dask Series Structure:
npartitions=1
None    float32
None        ...
Name: predictions, dtype: float32
Dask Name: _predict_part, 9 tasks

mrocklin 于 2017-02-20

我的短期目标是写一篇关于此的简短博文，以便希望其他有更多 XGBoost 经验和更多时间的人能够采纳这个项目并推动它向前发展。（我和这里的其他人一样，正在同时从事其他一些类似的项目。）

我偏爱航空公司数据集，只是因为我已经将它放在了 S3 存储桶中。尽管我同意 Criteo 数据集可以进行更好的大规模演示。

我仍然不确定要使用哪些参数或如何判断结果。对于参数，我可以在这里使用@szilard的实验。有没有判断预测的好方法？例如，我们是否正在寻找predictions > 0.5来匹配labels_test ？

mrocklin 于 2017-02-20

评估二元分类（尤其是在研究或竞争环境中）的预测性能的最常见方法可能是使用 ROC 曲线下面积 (AUC)，尽管在实际应用中应该使用与“业务”值一致的指标使用模型制作。

szilard 于 2017-02-20

例如，我们是否正在寻找 > 0.5 的预测来匹配标签测试？

是的。如果你在测试集上取平均值，这就是测试准确度。但是数据集很可能是不平衡的（没有点击比点击多得多）。在这种情况下， ROC AUC分数是一个更好的指标。

from sklearn.metrics import roc_auc_score
print(roc_auc_score(labels_test, predictions))

假设predictions是模型为测试集中每一行估计的正概率的一维数组。

ogrisel 于 2017-02-20

@mrocklin一个后续问题，dask 是否允许多线程工作？由于 GIL，我知道这与 python 不太相关。但是 xgboost 可以允许每个工作人员进行多线程训练，同时仍然可以分布式地相互协调。我们应该始终将 xgboost 的 nthread 参数设置为该 worker 的工作内核数

tqchen 于 2017-02-20

简短的回答是“是”。 Dask 的大多数用途是用于 NumPy、Pandas、SKLearn 等项目，这些项目大多只是 C 和 Fortran 代码，用 Python 包装。 GIL 不会影响这些库。有些人确实将 Dask 用于与 PySpark RDD 类似的应用程序（请参阅dask.bag ）并且会受到影响。不过这个群体是少数。

所以是的，Dask 允许多线程任务。我们如何告诉 XGBoost 使用多线程？到目前为止，在我的实验中，我看到 CPU 使用率很高而没有更改任何参数，所以默认情况下一切正常吗？

mrocklin 于 2017-02-20

XGBoost 默认使用多线程，如果未设置 nthread，将使用机器上所有可用的 cpu 线程（而不是该工作线程）。当多个工人被分配到同一台机器时，这可能会产生竞争条件。

因此，最好将 nthread 参数设置为 worker 允许使用的最大内核数。通常一个好的做法是每个工人使用大约 4 个线程

tqchen 于 2017-02-21

当然，应该在
https://github.com/mrocklin/dask-xgboost/commit/c22d066b67c78710d5ad99b8620edc55182adc8f

2017 年 2 月 20 日星期一下午 6:31，Tianqi Chen [email protected]
写道：

XGBoost 默认使用多线程，并且会使用所有可用的 cpu
如果未设置 nthread，则机器上的线程（而不是该工作人员）。
当多个工作人员分配给同一个工作人员时，这可能会产生竞争条件
机器。
所以最好将 nthread 参数设置为最大数量
工人允许使用的核心。通常一个好的做法是在说周围使用
每个工人 4 个线程
—
你收到这个是因为你被提到了。
直接回复此邮件，在 GitHub 上查看
https://github.com/dmlc/xgboost/issues/2032#issuecomment-281205747或静音
线程
https://github.com/notifications/unsubscribe-auth/AASszPELRoeIvqEzyJhkKumIs-vd0PHiks5reiJngaJpZM4L_PXa
.

mrocklin 于 2017-02-21

笔记本： https ://gist.github.com/19c89d78e34437e061876a9872f4d2df
短截屏视频（六分钟）： https ://youtu.be/Cc4E-PdDSro

非常欢迎批评性的反馈。再次，请原谅我在这个领域的无知。

mrocklin 于 2017-02-21

👍3

@mrocklin很棒的演示！我认为通过在参数字典中使用'tree_method': 'hist', 'grow_policy': 'lossguide'可以大大提高运行时性能（可能还有内存使用）。

ogrisel 于 2017-02-21

谢谢@ogrisel。有了这些参数，训练时间从六分钟到一分钟。内存使用似乎保持不变。

mrocklin 于 2017-02-21

好的，回到这个。除了训练和预测我们应该实施的任何 XGBoost 操作吗？

@tqchen或@ogrisel如果你们有时间在https://github.com/mrocklin/dask-xgboost/blob/master/dask_xgboost/core.py查看实现，我将不胜感激。我明白，虽然在优先级列表中查看外国代码库并不总是很高。

如果一切正常，那么我将在 README 中添加更多内容，发布到 PyPI，然后我们就可以关闭这个问题了。

mrocklin 于 2017-02-27

我认为只有训练和预测需要分发。其他东西不必分发，因为它们不回复数据集

tqchen 于 2017-02-27

我已将 dask-xgboost 推送到 PyPI 并将其移至https://github.com/dask/dask-xgboost

感谢@tqchen和@ogrisel在这里提供的帮助。合作使这相对容易。

如果人们想运行基准测试，我很乐意为他们提供帮助。到那时，关门。

mrocklin 于 2017-02-27

😄1

Xgboost: 使用 Dask 进行分布式计算

最有用的评论

所有46条评论

在运行任何 XGBoost 代码之前，我们设置了一个 Rabit 网络

训练

示例数据集/问题

相关问题