Xgboost: One step update of python's xgboost.Booster fails with a segfault

Created on 22 Mar 2017  ·  3Comments  ·  Source: dmlc/xgboost

I was experimenting with one-step incremental XGBoost ensemble construction. When the Booster is created by xgboost.train function (python library), everything seems to work fine.
However when I create the booster and the update it using Booster.update like this:
```python
booster_ = xgboost.Booster({'objective': 'reg:linear'})
booster_.update(dtrain, 1)

the python process fails with a segmentation fault.

## Environment info
Operating System:
* **python 3.6** Mac OS X 10.10.5 (Darwin 14.5.0), Ubuntu 14.04.5 LTS (GNU/Linux 3.19.0-25-generic x86_64);
* **python 2.7** Mac OS X 10.10.6 (Darwin 15.6.0);

Compiler:
* **python 3.6** used `pip install xgboost`;
* **python 2.7** gcc (6.3.0 --without-multilib);

`xgboost` version used:
* **python 3.6** version 0.6 from pip;
* **python 2.7.13** git HEAD 4a63f4ab43480adaaf13bde2485d5bfedd952520;

## Steps to reproduce
```python
import xgboost
dtrain = xgboost.DMatrix(data=[[-1.0], [0.0], [1.0]], label=[0.0, -1.0, 1.0])

booster_ = xgboost.Booster({'objective': 'reg:linear', 'max_depth': 1})
booster_.update(dtrain, 1)

booster_.update(dtrain, 1)

The last line causes segmentation fault. I attach the crash report for python 2.7.13

Most helpful comment

I found out what the root of the problem was. I turns out that creating an empty booster by calling

booster_ = xgboost.Booster({'objective': 'reg:linear'})

only partially initializes the GBTree booster. In particular, the crucial parameter num_feature
is set by default to 0 and does not get updated to a proper number of features by a subsequent .update() call.

However, passing an explicit value for num_feature resolves the segmentation fault:

booster_ = xgboost.Booster({'objective': 'reg:linear', 'num_feature': dtrain.num_col()})

I think that xgboost.Booster() should issue a warning if either cache=() is empty, or
'num_feature' is not explicitly set in params argument.

All 3 comments

I would like to clarify what my expectation is from doing incremental updates of empty Booster object returned by

booster_ = xgboost.Booster({'objective': 'reg:linear', 'max_depth': 1})

According to the general gradient boosting algorithm , I expected that after updating on a sample (X, y) in DMatrix dtrain with

booster_.update(dtrain, 1)

the empty booster would become either f_0(x) -- the constant prediction, or f_1(x) -- prediction after one step of gradient boosting, as shown below (from Hastie, Tibshirani, Friedman; 2013 10th ed page 361).

screen shot 2017-03-21 at 18 07 07

I managed to trace the problem down to an empty RegTree::FVec.data vector in hte provided reproducing example. The (manual) trace of the second call to booster_.update is as follows:

fails here:
include/xgboost/tree_model.h:#L528
..., feat.fvalue(split_index), feat.is_missing(split_index), ...

it seems that feat.data.size() == 0. I don't know why after the first call to .update() this vector is still empty, but non-empty after the alternative call to .train().

I found out what the root of the problem was. I turns out that creating an empty booster by calling

booster_ = xgboost.Booster({'objective': 'reg:linear'})

only partially initializes the GBTree booster. In particular, the crucial parameter num_feature
is set by default to 0 and does not get updated to a proper number of features by a subsequent .update() call.

However, passing an explicit value for num_feature resolves the segmentation fault:

booster_ = xgboost.Booster({'objective': 'reg:linear', 'num_feature': dtrain.num_col()})

I think that xgboost.Booster() should issue a warning if either cache=() is empty, or
'num_feature' is not explicitly set in params argument.

Was this page helpful?
0 / 5 - 0 ratings