Xgboost: Approach (documentation) ambiguity

Created on 25 Nov 2015 · 3Comments · Source: dmlc/xgboost

Hi, I'm trying to understand xgboost approach to model training. By that I mean the following. I can either use Booster and invoke xgb.train or I can use sklearn APIs to use Classifier/Regressor. The problem that in former approach (according to xgboost documentation) I need to specify number of boosting iterations and do not specify any number of trees, while in later I do specify number of boosted trees to fit, but I don't have any option to specify number of boosted iterations. Does number of boosting iterations is number of estimators? Can someone clarify this?

If I read code correctly, the sklearn APIs uses internal booster, so I can assign clf._Booster to my xgb.train where n_estimators is equal to number of boosted rounds. And, I don't need to call clf.fit if such assignment has been made. Am I right? If so, can documentation be adjust to say explicitly that n_estimators is what num_rounds used for xgb.train.

It would be nice to have documentation which explains exactly relationship between low-level booster approach and sklearn APIs, I understand that later will call former one, but it would be nice to understand how it's done and clarify this in documentation to avoid confusion for end-users.

Source

vkuznet

👍4 ❤1

Most helpful comment

Hi Vladimir,

Boosting, in general, is a meta-algorithm which iteratively (or in "rounds") trains a sequence of simple/weak learners (or estimators) in such a way that the whole combination performs better. There could be various kinds of weak learners, not only trees. E.g., xgboost has trees and (generalized) linear model options. Curiously, these weak learners in xgboost are referred to as "boosters", and the whole combination model is referred to as a "learner". That perplexed me a bit when I was initially reading xgboost's docs and code, as it was kind of reverse to my own mental imprint of the terminology.

While it would be nice to have consistent nomenclature between different projects related to the same subject matter and to have consistent high quality cross-link documentation, it's not always feasible or easily maintainable, especially in open source projects. Novice users tend to mostly use a single interface. Users who are well familiar with the subject matter would be easily able to map the commonalities. And whoever cares about low-level details, they usually care enough to read the code.

As for your specific question, you are reading right that the n_estimators parameter in the sklearn wrapper code [1] maps to num_boost_round within train [2]. However, why as an end-user would you want to hack a sklearn wrapper object that way? To do it in a controllable manner, you would have to get to know the wrapper code rather well.

[1] https://github.com/dmlc/xgboost/blob/2859c190cd0d168df25e2a7ea2b1fd5211ce94f0/python-package/xgboost/sklearn.py#L185
[2] https://github.com/dmlc/xgboost/blob/83e61bf99ec7d01607867b4e281da283230883b1/python-package/xgboost/training.py#L12

khotilov on 26 Nov 2015

👍4

All 3 comments

Hi Vladimir,

khotilov on 26 Nov 2015

👍4

are there same problems for R?

On Thu, Nov 26, 2015 at 11:39 AM, Vadim Khotilovich <
[email protected]> wrote:

Hi Vladimir,

Boosting, in general, is a meta-algorithm which iteratively (or in
"rounds") trains a sequence of simple/weak learners (or estimators) in such
a way that the whole combination performs better. There could be various
kinds of weak learners, not only trees. E.g., xgboost has trees and
(generalized) linear model options. Curiously, these weak learners in
xgboost are referred to as "boosters", and the whole combination model is
referred to as a "learner". That perplexed me a bit when I was initially
reading xgboost's docs and code, as it was kind of reverse to my own mental
imprint of the terminology.

While it would be nice to have consistent nomenclature between different
projects related to the same subject matter and to have consistent high
quality cross-link documentation, it's not always feasible or easily
maintainable, especially in open source projects. Novice users tend to
mostly use a single interface. Users who are well familiar with the subject
matter would be easily able to map the commonalities. And whoever cares
about low-level details, they usually care enough to read the code.

As for your specific question, you are reading right that the n_estimators
parameter in the sklearn wrapper code [1] maps to num_boost_round within
train [2]. However, why as an end-user would you want to hack a sklearn
wrapper object that way? To do it in a controllable manner, you would have
to get to know the wrapper code rather well.

[1]
https://github.com/dmlc/xgboost/blob/2859c190cd0d168df25e2a7ea2b1fd5211ce94f0/python-package/xgboost/sklearn.py#L185
[2]
https://github.com/dmlc/xgboost/blob/83e61bf99ec7d01607867b4e281da283230883b1/python-package/xgboost/training.py#L12

—
Reply to this email directly or view it on GitHub
https://github.com/dmlc/xgboost/issues/651#issuecomment-159956213.