Xgboost: feature_names mismatch while using sparse matrices in Python

Created on 31 May 2016 · 51Comments · Source: dmlc/xgboost

I'm getting ValueError: feature_names mismatch while training xgboost with sparse matrices in python.
The xgboost version is latest from git. Older versions don't give this error. Error is returned during prediction time.

code

from scipy import sparse
import xgboost as xgb
from random import *
randBinList = lambda n: [randint(0,1) for b in range(1,n+1)]

train = sparse.rand(100,500)
test = sparse.rand(10, 500)
y = randBinList(100)
clf = xgb.XGBClassifier()
clf.fit(train,y)
preds = clf.predict_proba(test)

Full traceback:

ValueError                                Traceback (most recent call last)
<ipython-input-15-e03f10289bf1> in <module>()
----> 1 preds = clf.predict_proba(test)

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/sklearn.pyc in predict_proba(self, data, output_margin, ntree_limit)
    471         class_probs = self.booster().predict(test_dmatrix,
    472                                              output_margin=output_margin,
--> 473                                              ntree_limit=ntree_limit)
    474         if self.objective == "multi:softprob":
    475             return class_probs

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308', 'f309', 'f310', 'f311', 'f312', 'f313', 'f314', 'f315', 'f316', 'f317', 'f318', 'f319', 'f320', 'f321', 'f322', 'f323', 'f324', 'f325', 'f326', 'f327', 'f328', 'f329', 'f330', 'f331', 'f332', 'f333', 'f334', 'f335', 'f336', 'f337', 'f338', 'f339', 'f340', 'f341', 'f342', 'f343', 'f344', 'f345', 'f346', 'f347', 'f348', 'f349', 'f350', 'f351', 'f352', 'f353', 'f354', 'f355', 'f356', 'f357', 'f358', 'f359', 'f360', 'f361', 'f362', 'f363', 'f364', 'f365', 'f366', 'f367', 'f368', 'f369', 'f370', 'f371', 'f372', 'f373', 'f374', 'f375', 'f376', 'f377', 'f378', 'f379', 'f380', 'f381', 'f382', 'f383', 'f384', 'f385', 'f386', 'f387', 'f388', 'f389', 'f390', 'f391', 'f392', 'f393', 'f394', 'f395', 'f396', 'f397', 'f398', 'f399', 'f400', 'f401', 'f402', 'f403', 'f404', 'f405', 'f406', 'f407', 'f408', 'f409', 'f410', 'f411', 'f412', 'f413', 'f414', 'f415', 'f416', 'f417', 'f418', 'f419', 'f420', 'f421', 'f422', 'f423', 'f424', 'f425', 'f426', 'f427', 'f428', 'f429', 'f430', 'f431', 'f432', 'f433', 'f434', 'f435', 'f436', 'f437', 'f438', 'f439', 'f440', 'f441', 'f442', 'f443', 'f444', 'f445', 'f446', 'f447', 'f448', 'f449', 'f450', 'f451', 'f452', 'f453', 'f454', 'f455', 'f456', 'f457', 'f458', 'f459', 'f460', 'f461', 'f462', 'f463', 'f464', 'f465', 'f466', 'f467', 'f468', 'f469', 'f470', 'f471', 'f472', 'f473', 'f474', 'f475', 'f476', 'f477', 'f478', 'f479', 'f480', 'f481', 'f482', 'f483', 'f484', 'f485', 'f486', 'f487', 'f488', 'f489', 'f490', 'f491', 'f492', 'f493', 'f494', 'f495', 'f496', 'f497', 'f498'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308', 'f309', 'f310', 'f311', 'f312', 'f313', 'f314', 'f315', 'f316', 'f317', 'f318', 'f319', 'f320', 'f321', 'f322', 'f323', 'f324', 'f325', 'f326', 'f327', 'f328', 'f329', 'f330', 'f331', 'f332', 'f333', 'f334', 'f335', 'f336', 'f337', 'f338', 'f339', 'f340', 'f341', 'f342', 'f343', 'f344', 'f345', 'f346', 'f347', 'f348', 'f349', 'f350', 'f351', 'f352', 'f353', 'f354', 'f355', 'f356', 'f357', 'f358', 'f359', 'f360', 'f361', 'f362', 'f363', 'f364', 'f365', 'f366', 'f367', 'f368', 'f369', 'f370', 'f371', 'f372', 'f373', 'f374', 'f375', 'f376', 'f377', 'f378', 'f379', 'f380', 'f381', 'f382', 'f383', 'f384', 'f385', 'f386', 'f387', 'f388', 'f389', 'f390', 'f391', 'f392', 'f393', 'f394', 'f395', 'f396', 'f397', 'f398', 'f399', 'f400', 'f401', 'f402', 'f403', 'f404', 'f405', 'f406', 'f407', 'f408', 'f409', 'f410', 'f411', 'f412', 'f413', 'f414', 'f415', 'f416', 'f417', 'f418', 'f419', 'f420', 'f421', 'f422', 'f423', 'f424', 'f425', 'f426', 'f427', 'f428', 'f429', 'f430', 'f431', 'f432', 'f433', 'f434', 'f435', 'f436', 'f437', 'f438', 'f439', 'f440', 'f441', 'f442', 'f443', 'f444', 'f445', 'f446', 'f447', 'f448', 'f449', 'f450', 'f451', 'f452', 'f453', 'f454', 'f455', 'f456', 'f457', 'f458', 'f459', 'f460', 'f461', 'f462', 'f463', 'f464', 'f465', 'f466', 'f467', 'f468', 'f469', 'f470', 'f471', 'f472', 'f473', 'f474', 'f475', 'f476', 'f477', 'f478', 'f479', 'f480', 'f481', 'f482', 'f483', 'f484', 'f485', 'f486', 'f487', 'f488', 'f489', 'f490', 'f491', 'f492', 'f493', 'f494', 'f495', 'f496', 'f497', 'f498', 'f499']
training data did not have the following fields: f499

Source

abhishekkrthakur

👍13

Most helpful comment

The problem occurs due to DMatrix..num_col() only returning the amount of non-zero columns in a sparse matrix. Hence, if both train & test data have the same amount of non-zero columns, everything works fine.
Otherwise, you end up with different feature names lists, because the validation functions calls:

    @property
    def feature_names(self):
        """Get feature names (column labels).

        Returns
        -------
        feature_names : list or None
        """
        if self._feature_names is None:
            return ['f{0}'.format(i) for i in range(self.num_col())]
        else:
            return self._feature_names

self._feature_names is None for sparse matrices, and because self.num_col() returns only the amount of non-zero cols, the validation fails as soon as the the amount of non-zero columns in the "to-be-predicted" data differs from the amount of non-zero columns in the training data.

Dunno yet, where the best spot is to fix that though.

Far0n on 2 Sep 2016

👍24

All 51 comments

It seems that this works only if the sparse matrics is CSC. It doesn't work for CSR or COO matrices like earlier versions.

abhishekkrthakur on 31 May 2016

👍6

Isn't it a random issue occurs when the right-most columns is all 0 or 1? Maybe the same as #1091 and #1221.

sinhrks on 18 Jun 2016

@sinhrks: For me, that's not "random". I frequently train XGBoost on highly sparse data (and it's awesome! It normally beats all other models, and by a pretty wide margin).

Then, once I've got the trained model running in production, I'll, of course, want to make predictions on a new piece of incoming data. That data, of course, is highly likely to be sparse and not have a value for whatever column happens to be the last column. So XGBoost now frequently breaks for me, and I've found myself switching to other (less accurate) models, simply because they've got better support for sparse data.

ClimbsRocks on 23 Aug 2016

👍12

Does anyone know exactly why this error now arises and how to address it? This is a pain point to me as my existing scripts are failing.

bryan-woods on 24 Aug 2016

I am giving xgboost a try as part of an sklearn pipeline and ran into the same issue. Is there a workaround until its fixed?

EntilZha on 27 Aug 2016

👍2

Yes, when you call predict, use the toarray() function of the sparse array. It is terribly inefficient with memory, but is workable with small slices.

Sent from my iPhone

On Aug 26, 2016, at 10:44 PM, Pedro Rodriguez [email protected] wrote:

I am giving xgboost a try as part of an sklearn pipeline and ran into the same issue. Is there a workaround until its fixed?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

bryan-woods on 27 Aug 2016

For some reason the error is not occurring if I save and load trained model:

    bst = xgb.train(param, dtrain, num_round)

    # predict is not working without this code
    bst.save_model(model_file_name)
    bst = xgb.Booster(param)
    bst.load_model(model_file_name)

    preds = bst.predict(dtest)

warpuv on 31 Aug 2016

👍14 ❤3

@bryan-woods I was able to find a better work around with tocsc. There is probably some performance penalty but not nearly as bad as making it a dense matrix.

Including this in my sklearn pipeline right before xgboost worked

class CSCTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.tocsc()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

EntilZha on 31 Aug 2016

👍8

Neither CSC format nor adding non zero entries into the last column fixes the issue in the most recent version of xgboost. Reverting back to version 0.4a30 is the only thing I can get make it work, consider the following tweak (with a reproducible seed) on the original example:

>>> import xgboost as xgb
>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> np.random.seed(10)
>>> X = sparse.rand(100,10).tocsr()
>>> test = sparse.rand(10, 500).tocsr()
>>> y = np.random.randint(2,size=100)
>>> 
>>> clf = xgb.XGBClassifier()
>>> clf.fit(X,y)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
>>> 
>>> try:
...     pred = clf.predict_proba(test)
...     print "Works when csr with version %s" %xgb.__version__
... except ValueError:
...     "Broken when csr with version %s" %xgb.__version__
... 
'Broken when csr with version 0.6'
>>> try:
...     pred = clf.predict_proba(test.tocsc())
...     print "Works when csc with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when csc with version %s" %xgb.__version__
... 
'Still broken when csc with version 0.6'
>>> try:
...     test[0,(test.shape[1]-1)] = 1.0
...     pred = clf.predict_proba(test)
...     print "Works when adding non-zero entries to last column with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when adding non-zero entries to last column with version %s" %xgb.__version__
... 
/home/david.mcgarry/.conda/envs/ml/lib/python2.7/site-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
'Still broken when adding non-zero entries to last column with version 0.6'

>>> import xgboost as xgb
>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> np.random.seed(10)
>>> X = sparse.rand(100,10).tocsr()
>>> test = sparse.rand(10, 500).tocsr()
>>> y = np.random.randint(2,size=100)
>>> 
>>> clf = xgb.XGBClassifier()
>>> clf.fit(X,y)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
>>> 
>>> try:
...     pred = clf.predict_proba(test)
...     print "Works when csr with version %s" %xgb.__version__
... except ValueError:
...     "Broken when csr with version %s" %xgb.__version__
... 
Works when csr with version 0.4
>>> try:
...     pred = clf.predict_proba(test.tocsc())
...     print "Works when csc with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when csc with version %s" %xgb.__version__
... 
Works when csc with version 0.4
>>> try:
...     test[0,(test.shape[1]-1)] = 1.0
...     pred = clf.predict_proba(test)
...     print "Works when adding non-zero entries to last column with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when adding non-zero entries to last column with version %s" %xgb.__version__
... 
/Users/david.mcgarry/anaconda/envs/ml/lib/python2.7/site-packages/scipy/sparse/compressed.py:739: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
Works when adding non-zero entries to last column with version 0.4

dmcgarry on 1 Sep 2016

👍7 🎉2

Same issue here, something definitely got broken in the last release. Did not have this issue before with the same dataset and processing. I might be wrong, but it looks like currently there is no unit tests with sparse csr arrays in Python using the sklearn API. Would it be possible to add the @dmcgarry example above to tests/python/tests_with_sklearn.py?

rth on 1 Sep 2016

👍4

I tried working around it using .toarray() with CSR sparse arrays, but something is seriously broken. If I load a saved model and try using it to make predictions with .toarray(), I don't get an error message but the results are incorrect. I rolled back to 0.4a30 and it works fine. I haven't had the time to chase down the root cause, but it's not good.

bryan-woods on 1 Sep 2016

👍1

    @property
    def feature_names(self):
        """Get feature names (column labels).

        Returns
        -------
        feature_names : list or None
        """
        if self._feature_names is None:
            return ['f{0}'.format(i) for i in range(self.num_col())]
        else:
            return self._feature_names

Dunno yet, where the best spot is to fix that though.

Far0n on 2 Sep 2016

👍24

I'm also afraid, that there is a fundamental problem with sparse matrix handling, because of what @bryan-woods reported: Let's say, we have x zero-columns both in train and in test, but with different indices => There will be no error, because "feature_names(self)" returns the same feature list for both sets, but the predictions mith be wrong, due to not-matching non-zero column indices between train and test.

Far0n on 2 Sep 2016

👍2

Has anyone worked on this issue? Does anyone at least have a unit test developed that we could use to develop against?

bryan-woods on 16 Sep 2016

👍1

I have not worked on it but @dmcgarry's example above could be used as a beginning of a unit test, I think,

import xgboost as xgb
import numpy as np
import scipy.sparse


def test_xgbclassifier_sklearn_sparse():
    np.random.seed(10)
    X = scipy.sparse.rand(100,10).tocsr()
    test = scipy.sparse.rand(10, 500).tocsr()
    y = np.random.randint(2,size=100)

    clf = xgb.XGBClassifier()
    clf.fit(X,y)
    pred = clf.predict_proba(test)

rth on 16 Sep 2016

I created a couple new sparse array tests in my fork of the repo. For those who are interested:
https://github.com/bryan-woods/xgboost/blob/sparse_test/tests/python/test_scipy_sparse.py

To run the tests from the root directory of the checkout:
python -m nose tests/python/test_scipy_sparse.py

You'll notice that both tests fail. This at least will provide a test to develop against.

bryan-woods on 16 Sep 2016

👍3

I'm also experiencing this issue, but I not able to figure out what's the best way to fix until it is finally solved in the lib.

vallettea on 17 Sep 2016

move to https://github.com/dmlc/xgboost/issues/1583

tqchen on 17 Sep 2016

you could add a feature to your feature list with a max feature index, such as maxid:0

bihujrj on 8 Nov 2016

👍1

passing a dataframe solved the issue for me

nazirmubbashir on 23 Nov 2016

how can I revert to version 0.4?

dfernandez22 on 24 Nov 2016

pip install --upgrade xgboost==0.4a30

dmcgarry on 24 Nov 2016

👍7

All types of sparse matrices didn't work for me (I'm working with tf-idf data). I had to revert to previous version. Thanks for the tip!

ad-owens on 27 Nov 2016

👍2

All of you who are still having issues: does the code that you are using include the fixes in #1606 ?

khotilov on 29 Nov 2016

Yes, I've installed the last version of xgboost and I still having this issue.

ivihrov on 30 Nov 2016

This still exists and is easily reproducible. If you use a large enough data set this is less likely to occur, but if you are wrapping this in a grid search object, it occurs almost certainly within a c.v. split where the features available in the train/c.v. test sets differ.

Honestly, I don't really understand why DMatrix ignores the shape hint that is provided by scipy sparse matrices. It should set the size based on that information instead of calculating it.

l3link on 1 Dec 2016

I am using the Xgboost Python native API (0.6) and I've got the same error when loading a DMatrix from a LIBSVM [sparse] format file, if any of the rows contains has the last column defined. My workaround was to define a dummy column in the first row :(

train_fv_file = 'train_fv_eval.svm'
dtrain = xgb.DMatrix(train_fv_file, feature_names=feature_vector_labels, feature_types=feature_vector_types)

gabrielspmoreira on 2 Dec 2016

If it is so easy to reproduce, does anyone care to provide a reproducible example? Preferably, without the sklearn layer (to isolate a possible cause).

@gabrielspmoreira : I see your point about loading from a LIBSVM file that has completely sparse last few columns... That DMatrix construction method would benefit from having a num_col hint as well.

khotilov on 6 Dec 2016

In [42]: matrix = xgboost.DMatrix(scipy.sparse.csr_matrix([[0, 2, 3, 0], [0, 2, 2, 0], [1, 0, 5, 0], [0, 1, 0, 0]], shape=(4,4)))
In [43]: matrix.num_col()
Out[43]: 3L

Anytime a new DMatrix is created on a subsample of rows/columns there is a possibility that this happens (the number of columns gets reduced even though we explicitly told DMatrix how many columns there are). This happens often for smaller data sets or very very sparse columns because it is more likely a subset will be all zeroes.

Once this happens between a train/test set, the model cannot evaluate the test set because it expects a different number of features and spits out a ValueError.

I'm trying to find a test where this works/doesn't work within xgboost core and the sklearn-wrapper as I'm confident what is happening, but I don't know where it is happening.

l3link on 7 Dec 2016

@l3link : your code seems to be outdated. Here's what I get:

In [2]: import scipy
   ...: import xgboost
   ...: matrix = xgboost.DMatrix(scipy.sparse.csr_matrix([[0, 2, 3, 0], [0, 2, 2, 0], [1, 0, 5, 0], [0, 1, 0, 0]], shape=(4,4)))
   ...: matrix.num_col()
   ...:
Out[2]: 4L

In [3]: matrix._init_from_csr??
Signature: matrix._init_from_csr(csr)
Source:
    def _init_from_csr(self, csr):
        """
        Initialize data from a CSR matrix.
        """
        if len(csr.indices) != len(csr.data):
            raise ValueError('length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data)))
        self.handle = ctypes.c_void_p()
        _check_call(_LIB.XGDMatrixCreateFromCSREx(c_array(ctypes.c_size_t, csr.indptr),
                                                  c_array(ctypes.c_uint, csr.indices),
                                                  c_array(ctypes.c_float, csr.data),
                                                  len(csr.indptr), len(csr.data),
                                                  csr.shape[1],
                                                  ctypes.byref(self.handle)))
File:      c:\anaconda2\lib\site-packages\xgboost-0.6-py2.7.egg\xgboost\core.py
Type:      instancemethod

khotilov on 7 Dec 2016

Huh,

In [64]: xgboost.__version__ Out[64]: '0.6'

Signature: matrix._init_from_csr(csr) Source: def _init_from_csr(self, csr): """ Initialize data from a CSR matrix. """ if len(csr.indices) != len(csr.data): raise ValueError('length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data))) self.handle = ctypes.c_void_p() _check_call(_LIB.XGDMatrixCreateFromCSR(c_array(ctypes.c_ulong, csr.indptr), c_array(ctypes.c_uint, csr.indices), c_array(ctypes.c_float, csr.data), len(csr.indptr), len(csr.data), ctypes.byref(self.handle))) File: ~/anaconda/lib/python2.7/site-packages/xgboost/core.py Type: instancemethod

Seems bizarre that my .6 version has XGDMatrixCreateFromCSR instead of XGDMatrixCreateFromCSREx instructions, which don't take in the shape.
Is it possible the osx distribution is different?

l3link on 8 Dec 2016

I'm also afraid, that there is a fundamental problem with sparse matrix handling, because of what @bryan-woods reported: Let's say, we have x zero-columns both in train and in test, but with different indices => There will be no error, because "feature_names(self)" returns the same feature list for both sets, but the predictions mith be wrong, due to not-matching non-zero column indices between train and test.

Can someone please answer this question? I reverted back to 0.4 version and now it seems to work but I'm afraid if it's working properly because I'm still using really sparse matrices.

ghost on 12 Dec 2016

@l3link nothing bizarre about it: version numbers (or pypi packages) are sometimes not updated for long time. E.g., the https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/VERSION file as of today was last changed on Jul 29th, and the last pypi package https://pypi.python.org/pypi/xgboost/ is dated Aug 9th. While the fix was submitted on Sep 23rd #1606. Please check out the latest code from github.

khotilov on 2 Jan 2017

I had this problem when I used pandas DataFrame (non-sparse representation).
I converted it to numpy ndarray via df.as_matrix(), and I got rid of the error.

oleksandrasaskia on 28 Jan 2017

👍5 🎉2 ❤1

I too got rid of this error after converting dataframe to array.

pnandhini11 on 23 Mar 2017

👍1

Reordering the columns in the test-set in the same order as the train set fixed this for me.
I used Pandas dataframes. Without this, using .as_matrix() was throwing the same issue.

I did:

test = test[train.columns]

fx86 on 25 Jun 2017

👍6 🎉2

I tried @warpuv solution and it worked. My data is large, I cannot load them to memory to reorder columns.

nguyentp on 20 Oct 2017

Converting train/test csr matrices to csc worked for me

Xtrain = scipy.sparse.csc_matrix(Xtrain)

bdod6 on 2 Nov 2017

👍5

Converting to csc_matrix works, tested on 0.6a2:

    X_train = scipy.sparse.csc_matrix(X_train)
    X_test = scipy.sparse.csc_matrix(X_test)

    xgb_train = xgb.DMatrix(X_train, label=y_train)
    xgb_test = xgb.DMatrix(X_test, label=y_test)

type(X_train) <class 'scipy.sparse.csr.csr_matrix'>
type(X_test) <class 'scipy.sparse.csr.csr_matrix'>
type(X_train) <class 'scipy.sparse.csc.csc_matrix'>
type(X_test) <class 'scipy.sparse.csc.csc_matrix'>
type(xgb_train) <class 'xgboost.core.DMatrix'>
type(xgb_test) <class 'xgboost.core.DMatrix'>

My original sparse matrix is output of sklearn tf-idf vectorizer were in csr_matrix format.

mrgloom on 15 Dec 2017

👍1

Is there any fix yet?

pallavbakshi on 21 Jan 2018

Just built the latest version (0.7.post3) in python3 and I can confirm that this issue still exists. After adapting @dmcgarry example above I am still seeing issues with both csr_matrix and csc_matrix.

import xgboost as xgb
import numpy as np
from scipy import sparse

np.random.seed(10)

X_csr = sparse.rand(100, 10).tocsr()
test_csr = sparse.rand(10, 500).tocsr()

X_csc = sparse.rand(100, 10).tocsc()
test_csc = sparse.rand(10, 500).tocsc()

y = np.random.randint(2, size=100)

clf_csr = xgb.XGBClassifier()
clf_csr.fit(X_csr, y)

clf_csc = xgb.XGBClassifier()
clf_csc.fit(X_csc, y)

# Try with csr
try:
    pred = clf_csr.predict_proba(test_csr)
    print("Works when csr with version %s" %xgb.__version__)
except ValueError:
    print("Broken when csr with version %s" %xgb.__version__)

try:
    test_csr[0,(test_csr.shape[1]-1)] = 1.0
    pred = clf_csr.predict_proba(test_csr)
    print("Works when adding non-zero entries to last column with version %s" %xgb.__version__)
except:
    print("Still broken when adding non-zero entries to last column with version %s" %xgb.__version__)

# Try with csc
try:
    pred = clf_csc.predict_proba(test_csc)
    print("Works when csc with version %s" %xgb.__version__)
except ValueError:
    print("Broken when csc with version %s" %xgb.__version__)

try:
    test_csc[0,(test_csc.shape[1]-1)] = 1.0
    pred = clf_csc.predict_proba(test_csc)
    print("Works when adding non-zero entries to last column with version %s" %xgb.__version__)
except:
    print("Still broken when adding non-zero entries to last column with version %s" %xgb.__version__)

The above code resulted in the following output:

Broken when csr with version 0.7
Still broken when adding non-zero entries to last column with version 0.7
Broken when csc with version 0.7
Still broken when adding non-zero entries to last column with version 0.7

ewellinger on 24 Jan 2018

pls help

hhristov94 on 27 Jan 2018

Why has this issue been closed?

ewellinger on 28 Feb 2018

I came across this problem twice recently. For one case, I simply change the input dataframe into array and it works. For the second one, I have to realign the column names of the test dataframe using test_df = test_df[train_df.columns]. In both cases, the train_df and test_df have exactly the same column names.

CathyQian on 1 Mar 2018

I guess I don't understand your comment @CathyQian, are those train_df/test_df sparse? Also, which version of xgboost were you running when you ran into these issues?

ewellinger on 1 Mar 2018

@CathyQian xgboost relies on the _order_ of columns, and that is not related to this issue.

@ewellinger WRT your example: a model trained on data with 10 features should not accept data with 500 features for prediction, hence the error is thrown. Also, creating DMatrices from all of your matrices and inspecting their num_col and num_row produces expected results.

The current state of "sparsity issues" is:

DMatrix creation from CSR and its use in a model should work correctly. The issue was closed since that was the subject of this issue.
DMatrix creation from CSC produces an object with correct dimensions, but it might give incorrect results during training or prediction when last rows are fully sparse #2630. I didn't yet have time to properly fix that part.
A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

khotilov on 2 Mar 2018

@warpuv it works for me, thanks a lot.

rainness on 20 Mar 2018

Had the same error, with dense matrices. ( xgboost v.0.6 from the latest anaconda.)
Error occured when I ran multiple regressions on different feature subsets of the training sample.
Creating new model instance each time before fitting a next regression fixed the problem.

ag95v2 on 1 Jun 2018

A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

As of 0.8, this still doesn't exist right?

JulianKlug on 4 Oct 2018

DMatrix creation from CSC produces an object with correct dimensions, but it might give incorrect results during training or prediction when last rows are fully sparse #2630. I didn't yet have time to properly fix that part.

@khotilov #3553 fixed this issue.

A parameter to specify a predefined number of columns when loading libsvm data into DMatrix was not implemented yet. Volunteers to contribute are welcome.

@MonsieurWave For this feature, a small pull request to dmlc-core should do the trick. Let me look at it.