Xgboost: Is it necessary to get rid of useless features in classification task in xgboost?

Created on 14 Mar 2017  ·  3Comments  ·  Source: dmlc/xgboost

I have 650+ features and I use xgboost to rank the importance of theses features. The function is get_score() which returns the occurrence of features when building trees.

When I check the result, there are nearly 150 features whose score is below 20. some of them is zero. I get rid of these features and see whether model will have a better performance. But the result is just nearly the same with previous one. So is it necessary to get rid of these features that are not importance ?

In experiment's result, the new order of importance of features is not like the previous order. Some were not too important become important. I mean the feature importance order is not steady. Why it is not steady ?

Most helpful comment

In most of machine learning algorithms it is important to do feature selections to avoid over-fitting.

In boosting, and especially in xgboost, the training is made so that you generalize as much as possible (weak learner, taylor approximation, stochastic learning,...) which makes it robust to over-fitting (that said, you can over-fit if you really want it :) -> high number of iterations, high learning rate, no stochastic learning...). To answer your questions, getting rid of irrelevant features might have an importance at one point. Say you have outliers, you might start using those irrelevant features to classify the outliers during the learning process and this is not good.

For your second question, there isn't only one way of combining the weak learners to achieve good results. however, to obtain different combinations (and different feature importances) you must be using some randomness during your training process such as subsample, max_features... Another possibility is that you have redundant features. You should check the correlation between them.

In summary, if you know a feature is useless : remove it. If you don't know then let them it is always good to remove irrelevant features as it will slow down the training of your model.

All 3 comments

In most of machine learning algorithms it is important to do feature selections to avoid over-fitting.

In boosting, and especially in xgboost, the training is made so that you generalize as much as possible (weak learner, taylor approximation, stochastic learning,...) which makes it robust to over-fitting (that said, you can over-fit if you really want it :) -> high number of iterations, high learning rate, no stochastic learning...). To answer your questions, getting rid of irrelevant features might have an importance at one point. Say you have outliers, you might start using those irrelevant features to classify the outliers during the learning process and this is not good.

For your second question, there isn't only one way of combining the weak learners to achieve good results. however, to obtain different combinations (and different feature importances) you must be using some randomness during your training process such as subsample, max_features... Another possibility is that you have redundant features. You should check the correlation between them.

In summary, if you know a feature is useless : remove it. If you don't know then let them it is always good to remove irrelevant features as it will slow down the training of your model.

thanks. I removed nearly 150 features which are not important. But the auc from new experiment is not changed. So I have not find any benefit from removing them.

The benefit, if not improving the performance measure, will be the training time. Decision trees use a greedy approach to find the best split, thus more features = more splits to try.

Was this page helpful?
0 / 5 - 0 ratings