Xgboost: Pythonμ—μ„œ ν¬μ†Œ 행렬을 μ‚¬μš©ν•˜λŠ” λ™μ•ˆ feature_names 뢈일치

에 λ§Œλ“  2016λ…„ 05μ›” 31일  Β·  51μ½”λ©˜νŠΈ  Β·  좜처: dmlc/xgboost

νŒŒμ΄μ¬μ—μ„œ ν¬μ†Œ ν–‰λ ¬λ‘œ xgboostλ₯Ό ν›ˆλ ¨ν•˜λŠ” λ™μ•ˆ ValueError: feature_names mismatchκ°€ λ°œμƒν•©λ‹ˆλ‹€.
xgboost 버전은 git의 μ΅œμ‹  λ²„μ „μž…λ‹ˆλ‹€. 이전 λ²„μ „μ—μ„œλŠ” 이 였λ₯˜κ°€ λ°œμƒν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 예츑 μ‹œκ°„ λ™μ•ˆ 였λ₯˜κ°€ λ°˜ν™˜λ©λ‹ˆλ‹€.

μ•”ν˜Έ

from scipy import sparse
import xgboost as xgb
from random import *
randBinList = lambda n: [randint(0,1) for b in range(1,n+1)]

train = sparse.rand(100,500)
test = sparse.rand(10, 500)
y = randBinList(100)
clf = xgb.XGBClassifier()
clf.fit(train,y)
preds = clf.predict_proba(test)

전체 역좔적:

ValueError                                Traceback (most recent call last)
<ipython-input-15-e03f10289bf1> in <module>()
----> 1 preds = clf.predict_proba(test)

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/sklearn.pyc in predict_proba(self, data, output_margin, ntree_limit)
    471         class_probs = self.booster().predict(test_dmatrix,
    472                                              output_margin=output_margin,
--> 473                                              ntree_limit=ntree_limit)
    474         if self.objective == "multi:softprob":
    475             return class_probs

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/usr/local/lib/python2.7/dist-packages/xgboost-0.4-py2.7.egg/xgboost/core.pyc in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308', 'f309', 'f310', 'f311', 'f312', 'f313', 'f314', 'f315', 'f316', 'f317', 'f318', 'f319', 'f320', 'f321', 'f322', 'f323', 'f324', 'f325', 'f326', 'f327', 'f328', 'f329', 'f330', 'f331', 'f332', 'f333', 'f334', 'f335', 'f336', 'f337', 'f338', 'f339', 'f340', 'f341', 'f342', 'f343', 'f344', 'f345', 'f346', 'f347', 'f348', 'f349', 'f350', 'f351', 'f352', 'f353', 'f354', 'f355', 'f356', 'f357', 'f358', 'f359', 'f360', 'f361', 'f362', 'f363', 'f364', 'f365', 'f366', 'f367', 'f368', 'f369', 'f370', 'f371', 'f372', 'f373', 'f374', 'f375', 'f376', 'f377', 'f378', 'f379', 'f380', 'f381', 'f382', 'f383', 'f384', 'f385', 'f386', 'f387', 'f388', 'f389', 'f390', 'f391', 'f392', 'f393', 'f394', 'f395', 'f396', 'f397', 'f398', 'f399', 'f400', 'f401', 'f402', 'f403', 'f404', 'f405', 'f406', 'f407', 'f408', 'f409', 'f410', 'f411', 'f412', 'f413', 'f414', 'f415', 'f416', 'f417', 'f418', 'f419', 'f420', 'f421', 'f422', 'f423', 'f424', 'f425', 'f426', 'f427', 'f428', 'f429', 'f430', 'f431', 'f432', 'f433', 'f434', 'f435', 'f436', 'f437', 'f438', 'f439', 'f440', 'f441', 'f442', 'f443', 'f444', 'f445', 'f446', 'f447', 'f448', 'f449', 'f450', 'f451', 'f452', 'f453', 'f454', 'f455', 'f456', 'f457', 'f458', 'f459', 'f460', 'f461', 'f462', 'f463', 'f464', 'f465', 'f466', 'f467', 'f468', 'f469', 'f470', 'f471', 'f472', 'f473', 'f474', 'f475', 'f476', 'f477', 'f478', 'f479', 'f480', 'f481', 'f482', 'f483', 'f484', 'f485', 'f486', 'f487', 'f488', 'f489', 'f490', 'f491', 'f492', 'f493', 'f494', 'f495', 'f496', 'f497', 'f498'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308', 'f309', 'f310', 'f311', 'f312', 'f313', 'f314', 'f315', 'f316', 'f317', 'f318', 'f319', 'f320', 'f321', 'f322', 'f323', 'f324', 'f325', 'f326', 'f327', 'f328', 'f329', 'f330', 'f331', 'f332', 'f333', 'f334', 'f335', 'f336', 'f337', 'f338', 'f339', 'f340', 'f341', 'f342', 'f343', 'f344', 'f345', 'f346', 'f347', 'f348', 'f349', 'f350', 'f351', 'f352', 'f353', 'f354', 'f355', 'f356', 'f357', 'f358', 'f359', 'f360', 'f361', 'f362', 'f363', 'f364', 'f365', 'f366', 'f367', 'f368', 'f369', 'f370', 'f371', 'f372', 'f373', 'f374', 'f375', 'f376', 'f377', 'f378', 'f379', 'f380', 'f381', 'f382', 'f383', 'f384', 'f385', 'f386', 'f387', 'f388', 'f389', 'f390', 'f391', 'f392', 'f393', 'f394', 'f395', 'f396', 'f397', 'f398', 'f399', 'f400', 'f401', 'f402', 'f403', 'f404', 'f405', 'f406', 'f407', 'f408', 'f409', 'f410', 'f411', 'f412', 'f413', 'f414', 'f415', 'f416', 'f417', 'f418', 'f419', 'f420', 'f421', 'f422', 'f423', 'f424', 'f425', 'f426', 'f427', 'f428', 'f429', 'f430', 'f431', 'f432', 'f433', 'f434', 'f435', 'f436', 'f437', 'f438', 'f439', 'f440', 'f441', 'f442', 'f443', 'f444', 'f445', 'f446', 'f447', 'f448', 'f449', 'f450', 'f451', 'f452', 'f453', 'f454', 'f455', 'f456', 'f457', 'f458', 'f459', 'f460', 'f461', 'f462', 'f463', 'f464', 'f465', 'f466', 'f467', 'f468', 'f469', 'f470', 'f471', 'f472', 'f473', 'f474', 'f475', 'f476', 'f477', 'f478', 'f479', 'f480', 'f481', 'f482', 'f483', 'f484', 'f485', 'f486', 'f487', 'f488', 'f489', 'f490', 'f491', 'f492', 'f493', 'f494', 'f495', 'f496', 'f497', 'f498', 'f499']
training data did not have the following fields: f499

κ°€μž₯ μœ μš©ν•œ λŒ“κΈ€

ν¬μ†Œ ν–‰λ ¬μ—μ„œ 0이 μ•„λ‹Œ μ—΄μ˜ μ–‘λ§Œ λ°˜ν™˜ν•˜λŠ” DMatrix..num_col() λ•Œλ¬Έμ— λ¬Έμ œκ°€ λ°œμƒν•©λ‹ˆλ‹€. λ”°λΌμ„œ ν›ˆλ ¨ 데이터와 ν…ŒμŠ€νŠΈ 데이터에 λ™μΌν•œ μ–‘μ˜ 0이 μ•„λ‹Œ 열이 있으면 λͺ¨λ“  것이 μ œλŒ€λ‘œ μž‘λ™ν•©λ‹ˆλ‹€.
그렇지 μ•ŠμœΌλ©΄ μœ νš¨μ„± 검사 ν•¨μˆ˜κ°€ λ‹€μŒμ„ ν˜ΈμΆœν•˜κΈ° λ•Œλ¬Έμ— λ‹€λ₯Έ κΈ°λŠ₯ 이름 λͺ©λ‘μœΌλ‘œ λλ‚©λ‹ˆλ‹€.

    <strong i="7">@property</strong>
    def feature_names(self):
        """Get feature names (column labels).

        Returns
        -------
        feature_names : list or None
        """
        if self._feature_names is None:
            return ['f{0}'.format(i) for i in range(self.num_col())]
        else:
            return self._feature_names

self._feature_names λŠ” ν¬μ†Œ ν–‰λ ¬μ˜ 경우 None이고, self.num_col()은 0이 μ•„λ‹Œ μ—΄μ˜ μ–‘λ§Œ λ°˜ν™˜ν•˜κΈ° λ•Œλ¬Έμ— "to-be- 예츑된" λ°μ΄ν„°λŠ” ν›ˆλ ¨ λ°μ΄ν„°μ˜ 0이 μ•„λ‹Œ μ—΄μ˜ μ–‘κ³Ό λ‹€λ¦…λ‹ˆλ‹€.

κ°€μž₯ 쒋은 지점은 아직 그것을 μˆ˜μ •ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

λͺ¨λ“  51 λŒ“κΈ€

이것은 ν¬μ†Œ 행렬이 CSC인 κ²½μš°μ—λ§Œ μž‘λ™ν•˜λŠ” 것 κ°™μŠ΅λ‹ˆλ‹€. 이전 버전과 같은 CSR λ˜λŠ” COO λ§€νŠΈλ¦­μŠ€μ—μ„œλŠ” μž‘λ™ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

맨 였λ₯Έμͺ½ 열이 λͺ¨λ‘ 0 λ˜λŠ” 1일 λ•Œ λ¬΄μž‘μœ„ λ¬Έμ œκ°€ λ°œμƒν•˜μ§€ μ•ŠμŠ΅λ‹ˆκΉŒ? #1091 및 #1221κ³Ό 같을 수 μžˆμŠ΅λ‹ˆλ‹€.

@sinhrks : λ‚˜μ—κ²Œ 그것은 "λ¬΄μž‘μœ„"κ°€ μ•„λ‹™λ‹ˆλ‹€. λ‚˜λŠ” 맀우 ν¬μ†Œν•œ 데이터에 λŒ€ν•΄ XGBoostλ₯Ό 자주 ν›ˆλ ¨μ‹œν‚΅λ‹ˆλ‹€.

그런 λ‹€μŒ ν”„λ‘œλ•μ…˜ ν™˜κ²½μ—μ„œ ν›ˆλ ¨λœ λͺ¨λΈμ„ μ‹€ν–‰ν•˜κ²Œ 되면 λ‹Ήμ—°νžˆ λ“€μ–΄μ˜€λŠ” λ°μ΄ν„°μ˜ μƒˆλ‘œμš΄ 뢀뢄에 λŒ€ν•΄ μ˜ˆμΈ‘μ„ ν•˜κ³  μ‹ΆμŠ΅λ‹ˆλ‹€. λ¬Όλ‘  ν•΄λ‹Ή λ°μ΄ν„°λŠ” ν¬μ†Œ κ°€λŠ₯성이 λ†’κ³  λ§ˆμ§€λ§‰ 열이 λ˜λŠ” 열에 λŒ€ν•œ 값이 μ—†μŠ΅λ‹ˆλ‹€. κ·Έλž˜μ„œ XGBoostλŠ” 이제 자주 μ€‘λ‹¨λ˜κ³ , ν¬μ†Œ 데이터λ₯Ό 더 잘 μ§€μ›ν•˜κΈ° λ•Œλ¬Έμ— λ‹€λ₯Έ(덜 정확함) λͺ¨λΈλ‘œ μ „ν™˜ν•˜λŠ” 것을 λ°œκ²¬ν–ˆμŠ΅λ‹ˆλ‹€.

이 였λ₯˜κ°€ λ°œμƒν•˜λŠ” μ΄μœ μ™€ ν•΄κ²° 방법을 μ •ν™•νžˆ μ•„λŠ” μ‚¬λžŒμ΄ μžˆμŠ΅λ‹ˆκΉŒ? κΈ°μ‘΄ μŠ€ν¬λ¦½νŠΈκ°€ μ‹€νŒ¨ν•˜κΈ° λ•Œλ¬Έμ— 이것은 μ €μ—κ²Œ κ³ ν†΅μŠ€λŸ¬μš΄ μ μž…λ‹ˆλ‹€.

sklearn νŒŒμ΄ν”„λΌμΈμ˜ μΌλΆ€λ‘œ xgboostλ₯Ό μ‹œλ„ν•˜κ³  λ™μΌν•œ λ¬Έμ œκ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€. 해결될 λ•ŒκΉŒμ§€ ν•΄κ²° 방법이 μžˆμŠ΅λ‹ˆκΉŒ?

예, predictλ₯Ό ν˜ΈμΆœν•  λ•Œ ν¬μ†Œ λ°°μ—΄μ˜ toarray() ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. λ©”λͺ¨λ¦¬μ—μ„œλŠ” 맀우 λΉ„νš¨μœ¨μ μ΄μ§€λ§Œ μž‘μ€ 쑰각으둜 μž‘μ—…ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λ‚΄ iPhoneμ—μ„œ 보낸

2016λ…„ 8μ›” 26일 μ˜€ν›„ 10μ‹œ 44뢄에 Pedro Rodriguez [email protected]이 λ‹€μŒκ³Ό 같이 μΌμŠ΅λ‹ˆλ‹€.

sklearn νŒŒμ΄ν”„λΌμΈμ˜ μΌλΆ€λ‘œ xgboostλ₯Ό μ‹œλ„ν•˜κ³  λ™μΌν•œ λ¬Έμ œκ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€. 해결될 λ•ŒκΉŒμ§€ ν•΄κ²° 방법이 μžˆμŠ΅λ‹ˆκΉŒ?

β€”
당신이 λŒ“κΈ€μ„ λ‹¬μ•˜κΈ° λ•Œλ¬Έμ— 이것을 λ°›λŠ” κ²ƒμž…λ‹ˆλ‹€.
이 이메일에 직접 λ‹΅μž₯ν•˜κ±°λ‚˜ GitHubμ—μ„œ λ³΄κ±°λ‚˜ μŠ€λ ˆλ“œλ₯Ό μŒμ†Œκ±°ν•˜μ„Έμš”.

μ–΄λ–€ 이유둜 ν›ˆλ ¨λœ λͺ¨λΈμ„ μ €μž₯ν•˜κ³  λ‘œλ“œν•˜λ©΄ 였λ₯˜κ°€ λ°œμƒν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

    bst = xgb.train(param, dtrain, num_round)

    # predict is not working without this code
    bst.save_model(model_file_name)
    bst = xgb.Booster(param)
    bst.load_model(model_file_name)

    preds = bst.predict(dtest)

@bryan-woods tocsc 둜 더 λ‚˜μ€ 방법을 찾을 수 μžˆμ—ˆμŠ΅λ‹ˆλ‹€. μ•½κ°„μ˜ μ„±λŠ₯ μ €ν•˜κ°€ μžˆμ„ 수 μžˆμ§€λ§Œ μ‘°λ°€ν•œ ν–‰λ ¬λ‘œ λ§Œλ“œλŠ” κ²ƒλ§ŒνΌ λ‚˜μ˜μ§€λŠ” μ•ŠμŠ΅λ‹ˆλ‹€.

xgboostκ°€ μž‘λ™ν•˜κΈ° 직전에 λ‚΄ sklearn νŒŒμ΄ν”„λΌμΈμ— 이것을 포함

class CSCTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.tocsc()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

CSC ν˜•μ‹μ΄λ‚˜ λ§ˆμ§€λ§‰ 열에 0이 μ•„λ‹Œ ν•­λͺ©μ„ 좔가해도 μ΅œμ‹  λ²„μ „μ˜ xgboostμ—μ„œ λ¬Έμ œκ°€ ν•΄κ²°λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 버전 0.4a30으둜 λ˜λŒλ¦¬λŠ” 것이 μž‘λ™ν•˜κ²Œ ν•  수 μžˆλŠ” μœ μΌν•œ λ°©λ²•μž…λ‹ˆλ‹€. μ›λž˜ μ˜ˆμ œμ—μ„œ λ‹€μŒ μ‘°μ •(μž¬ν˜„ κ°€λŠ₯ν•œ μ‹œλ“œ 포함)을 κ³ λ €ν•˜μ‹­μ‹œμ˜€.

>>> import xgboost as xgb
>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> np.random.seed(10)
>>> X = sparse.rand(100,10).tocsr()
>>> test = sparse.rand(10, 500).tocsr()
>>> y = np.random.randint(2,size=100)
>>> 
>>> clf = xgb.XGBClassifier()
>>> clf.fit(X,y)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
>>> 
>>> try:
...     pred = clf.predict_proba(test)
...     print "Works when csr with version %s" %xgb.__version__
... except ValueError:
...     "Broken when csr with version %s" %xgb.__version__
... 
'Broken when csr with version 0.6'
>>> try:
...     pred = clf.predict_proba(test.tocsc())
...     print "Works when csc with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when csc with version %s" %xgb.__version__
... 
'Still broken when csc with version 0.6'
>>> try:
...     test[0,(test.shape[1]-1)] = 1.0
...     pred = clf.predict_proba(test)
...     print "Works when adding non-zero entries to last column with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when adding non-zero entries to last column with version %s" %xgb.__version__
... 
/home/david.mcgarry/.conda/envs/ml/lib/python2.7/site-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
'Still broken when adding non-zero entries to last column with version 0.6'
>>> import xgboost as xgb
>>> import numpy as np
>>> from scipy import sparse
>>> 
>>> np.random.seed(10)
>>> X = sparse.rand(100,10).tocsr()
>>> test = sparse.rand(10, 500).tocsr()
>>> y = np.random.randint(2,size=100)
>>> 
>>> clf = xgb.XGBClassifier()
>>> clf.fit(X,y)
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
>>> 
>>> try:
...     pred = clf.predict_proba(test)
...     print "Works when csr with version %s" %xgb.__version__
... except ValueError:
...     "Broken when csr with version %s" %xgb.__version__
... 
Works when csr with version 0.4
>>> try:
...     pred = clf.predict_proba(test.tocsc())
...     print "Works when csc with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when csc with version %s" %xgb.__version__
... 
Works when csc with version 0.4
>>> try:
...     test[0,(test.shape[1]-1)] = 1.0
...     pred = clf.predict_proba(test)
...     print "Works when adding non-zero entries to last column with version %s" %xgb.__version__
... except ValueError:
...     "Still broken when adding non-zero entries to last column with version %s" %xgb.__version__
... 
/Users/david.mcgarry/anaconda/envs/ml/lib/python2.7/site-packages/scipy/sparse/compressed.py:739: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
Works when adding non-zero entries to last column with version 0.4

λ™μΌν•œ λ¬Έμ œκ°€ μžˆμŠ΅λ‹ˆλ‹€. λ§ˆμ§€λ§‰ λ¦΄λ¦¬μŠ€μ—μ„œ ν™•μ‹€νžˆ λ¬Έμ œκ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€. μ΄μ „μ—λŠ” λ™μΌν•œ 데이터 μ„ΈνŠΈ 및 μ²˜λ¦¬μ—μ„œ 이 λ¬Έμ œκ°€ λ°œμƒν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. λ‚΄κ°€ 틀릴 μˆ˜λ„ μžˆμ§€λ§Œ ν˜„μž¬ sklearn APIλ₯Ό μ‚¬μš©ν•˜μ—¬ Pythonμ—μ„œ ν¬μ†Œ csr 배열을 μ‚¬μš©ν•œ λ‹¨μœ„ ν…ŒμŠ€νŠΈκ°€ μ—†λŠ” 것 κ°™μŠ΅λ‹ˆλ‹€. μœ„μ˜ @dmcgarry 예제λ₯Ό tests/python/tests_with_sklearn.py μžˆμŠ΅λ‹ˆκΉŒ?

CSR 슀파슀 λ°°μ—΄κ³Ό ν•¨κ»˜ .toarray()λ₯Ό μ‚¬μš©ν•˜μ—¬ 이 문제λ₯Ό ν•΄κ²°ν•˜λ €κ³  μ‹œλ„ν–ˆμ§€λ§Œ μ‹¬κ°ν•œ λ¬Έμ œκ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€. μ €μž₯된 λͺ¨λΈμ„ λ‘œλ“œν•˜κ³  이λ₯Ό μ‚¬μš©ν•˜μ—¬ .toarray()둜 μ˜ˆμΈ‘μ„ μ‹œλ„ν•˜λ©΄ 였λ₯˜ λ©”μ‹œμ§€κ°€ ν‘œμ‹œλ˜μ§€ μ•Šμ§€λ§Œ κ²°κ³Όκ°€ μ˜¬λ°”λ₯΄μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 0.4a30으둜 λ‘€λ°±ν–ˆλŠ”λ° μ œλŒ€λ‘œ μž‘λ™ν•©λ‹ˆλ‹€. κ·Όλ³Έ 원인을 좔적할 μ‹œκ°„μ΄ μ—†μ—ˆμ§€λ§Œ 쒋지 μ•ŠμŠ΅λ‹ˆλ‹€.

ν¬μ†Œ ν–‰λ ¬μ—μ„œ 0이 μ•„λ‹Œ μ—΄μ˜ μ–‘λ§Œ λ°˜ν™˜ν•˜λŠ” DMatrix..num_col() λ•Œλ¬Έμ— λ¬Έμ œκ°€ λ°œμƒν•©λ‹ˆλ‹€. λ”°λΌμ„œ ν›ˆλ ¨ 데이터와 ν…ŒμŠ€νŠΈ 데이터에 λ™μΌν•œ μ–‘μ˜ 0이 μ•„λ‹Œ 열이 있으면 λͺ¨λ“  것이 μ œλŒ€λ‘œ μž‘λ™ν•©λ‹ˆλ‹€.
그렇지 μ•ŠμœΌλ©΄ μœ νš¨μ„± 검사 ν•¨μˆ˜κ°€ λ‹€μŒμ„ ν˜ΈμΆœν•˜κΈ° λ•Œλ¬Έμ— λ‹€λ₯Έ κΈ°λŠ₯ 이름 λͺ©λ‘μœΌλ‘œ λλ‚©λ‹ˆλ‹€.

    <strong i="7">@property</strong>
    def feature_names(self):
        """Get feature names (column labels).

        Returns
        -------
        feature_names : list or None
        """
        if self._feature_names is None:
            return ['f{0}'.format(i) for i in range(self.num_col())]
        else:
            return self._feature_names

self._feature_names λŠ” ν¬μ†Œ ν–‰λ ¬μ˜ 경우 None이고, self.num_col()은 0이 μ•„λ‹Œ μ—΄μ˜ μ–‘λ§Œ λ°˜ν™˜ν•˜κΈ° λ•Œλ¬Έμ— "to-be- 예츑된" λ°μ΄ν„°λŠ” ν›ˆλ ¨ λ°μ΄ν„°μ˜ 0이 μ•„λ‹Œ μ—΄μ˜ μ–‘κ³Ό λ‹€λ¦…λ‹ˆλ‹€.

κ°€μž₯ 쒋은 지점은 아직 그것을 μˆ˜μ •ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

λ˜ν•œ @bryan-woodsκ°€ λ³΄κ³ ν•œ λ‚΄μš©μœΌλ‘œ 인해 ν¬μ†Œ ν–‰λ ¬ μ²˜λ¦¬μ— 근본적인 λ¬Έμ œκ°€ μžˆλ‹€λŠ” 것이 λ‘λ ΅μŠ΅λ‹ˆλ‹€. "feature_names(self)"κ°€ 두 μ„ΈνŠΈμ— λŒ€ν•΄ λ™μΌν•œ κΈ°λŠ₯ λͺ©λ‘μ„ λ°˜ν™˜ν•˜κΈ° λ•Œλ¬Έμ— 였λ₯˜κ°€ λ°œμƒν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 기차와 ν…ŒμŠ€νŠΈ 간에 0이 μ•„λ‹Œ μ—΄ μΈλ±μŠ€κ°€ μΌμΉ˜ν•˜μ§€ μ•ŠκΈ° λ•Œλ¬Έμ— 예츑이 잘λͺ»λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

이 λ¬Έμ œμ— λŒ€ν•΄ μž‘μ—…ν•œ μ‚¬λžŒμ΄ μžˆμŠ΅λ‹ˆκΉŒ? κ°œλ°œμ— μ‚¬μš©ν•  수 μžˆλŠ” λ‹¨μœ„ ν…ŒμŠ€νŠΈλ₯Ό μ΅œμ†Œν•œ κ°œλ°œν•œ μ‚¬λžŒμ΄ μžˆμŠ΅λ‹ˆκΉŒ?

λ‚˜λŠ” 그것에 λŒ€ν•΄ μΌν•˜μ§€ μ•Šμ•˜μ§€λ§Œ μœ„μ˜ @dmcgarry 의 μ˜ˆλŠ” λ‹¨μœ„ ν…ŒμŠ€νŠΈμ˜ μ‹œμž‘μœΌλ‘œ μ‚¬μš©λ  수 μžˆλ‹€κ³  μƒκ°ν•©λ‹ˆλ‹€.

import xgboost as xgb
import numpy as np
import scipy.sparse


def test_xgbclassifier_sklearn_sparse():
    np.random.seed(10)
    X = scipy.sparse.rand(100,10).tocsr()
    test = scipy.sparse.rand(10, 500).tocsr()
    y = np.random.randint(2,size=100)

    clf = xgb.XGBClassifier()
    clf.fit(X,y)
    pred = clf.predict_proba(test)

μ €μž₯μ†Œμ˜ ν¬ν¬μ—μ„œ λͺ‡ 가지 μƒˆλ‘œμš΄ ν¬μ†Œ λ°°μ—΄ ν…ŒμŠ€νŠΈλ₯Ό λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€. 관심 μžˆλŠ” 뢄듀을 μœ„ν•΄:
https://github.com/bryan-woods/xgboost/blob/sparse_test/tests/python/test_scipy_sparse.py

μ²΄ν¬μ•„μ›ƒμ˜ 루트 λ””λ ‰ν† λ¦¬μ—μ„œ ν…ŒμŠ€νŠΈλ₯Ό μ‹€ν–‰ν•˜λ €λ©΄:
파이썬 -m μ½” ν…ŒμŠ€νŠΈ/python/test_scipy_sparse.py

두 ν…ŒμŠ€νŠΈ λͺ¨λ‘ μ‹€νŒ¨ν–ˆμŒμ„ μ•Œ 수 μžˆμŠ΅λ‹ˆλ‹€. 이것은 μ΅œμ†Œν•œ κ°œλ°œν•  ν…ŒμŠ€νŠΈλ₯Ό μ œκ³΅ν•  κ²ƒμž…λ‹ˆλ‹€.

저도 이 문제λ₯Ό κ²ͺκ³  μžˆμ§€λ§Œ libμ—μ„œ μ΅œμ’…μ μœΌλ‘œ 해결될 λ•ŒκΉŒμ§€ ν•΄κ²°ν•˜λŠ” κ°€μž₯ 쒋은 방법을 μ•Œ 수 μ—†μŠ΅λ‹ˆλ‹€.

maxid:0 κ³Ό 같은 μ΅œλŒ€ κΈ°λŠ₯ 인덱슀λ₯Ό μ‚¬μš©ν•˜μ—¬ κΈ°λŠ₯ λͺ©λ‘μ— κΈ°λŠ₯을 μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€

데이터 ν”„λ ˆμž„μ„ μ „λ‹¬ν•˜λ©΄ λ¬Έμ œκ°€ ν•΄κ²°λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

μ–΄λ–»κ²Œ 버전 0.4둜 되돌릴 수 μžˆμŠ΅λ‹ˆκΉŒ?

핍 μ„€μΉ˜ --μ—…κ·Έλ ˆμ΄λ“œ xgboost==0.4a30

λͺ¨λ“  μœ ν˜•μ˜ ν¬μ†Œ 행렬이 μž‘λ™ν•˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€(μ €λŠ” tf-idf λ°μ΄ν„°λ‘œ μž‘μ—… μ€‘μž…λ‹ˆλ‹€). 이전 λ²„μ „μœΌλ‘œ λ˜λŒλ €μ•Ό ν–ˆμŠ΅λ‹ˆλ‹€. 팁 κ³ λ§ˆμ›Œ!

μ—¬μ „νžˆ λ¬Έμ œκ°€ μžˆλŠ” λͺ¨λ“  μ‚¬μš©μž: μ‚¬μš© 쀑인 μ½”λ“œμ— #1606의 μˆ˜μ • 사항이 ν¬ν•¨λ˜μ–΄ μžˆμŠ΅λ‹ˆκΉŒ?

예, λ§ˆμ§€λ§‰ λ²„μ „μ˜ xgboostλ₯Ό μ„€μΉ˜ν–ˆλŠ”λ° μ—¬μ „νžˆ 이 λ¬Έμ œκ°€ μžˆμŠ΅λ‹ˆλ‹€.

이것은 μ—¬μ „νžˆ β€‹β€‹μ‘΄μž¬ν•˜λ©° μ‰½κ²Œ μž¬ν˜„ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μΆ©λΆ„νžˆ 큰 데이터 μ„ΈνŠΈλ₯Ό μ‚¬μš©ν•˜λŠ” 경우 이것은 λ°œμƒν•  κ°€λŠ₯성이 μ μ§€λ§Œ, 이λ₯Ό κ·Έλ¦¬λ“œ 검색 객체둜 λž˜ν•‘ν•˜λŠ” 경우 train/cv ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ—μ„œ μ‚¬μš© κ°€λŠ₯ν•œ κΈ°λŠ₯이 λ‹€λ₯Έ cv λΆ„ν•  λ‚΄μ—μ„œ 거의 ν™•μ‹€ν•˜κ²Œ λ°œμƒν•©λ‹ˆλ‹€.

μ†”μ§νžˆ λ§ν•΄μ„œ DMatrixκ°€ scipy ν¬μ†Œ ν–‰λ ¬μ—μ„œ μ œκ³΅ν•˜λŠ” λͺ¨μ–‘ 힌트λ₯Ό λ¬΄μ‹œν•˜λŠ” 이유λ₯Ό 잘 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€. 크기λ₯Ό κ³„μ‚°ν•˜λŠ” λŒ€μ‹  정보λ₯Ό 기반으둜 μ„€μ •ν•΄μ•Ό ν•©λ‹ˆλ‹€.

Xgboost Python κΈ°λ³Έ API(0.6)λ₯Ό μ‚¬μš© 쀑이며 LIBSVM [sparse] ν˜•μ‹ νŒŒμΌμ—μ„œ DMatrixλ₯Ό λ‘œλ“œν•  λ•Œ λ™μΌν•œ 였λ₯˜κ°€ λ°œμƒν•©λ‹ˆλ‹€. ν¬ν•¨λœ 행에 λ§ˆμ§€λ§‰ 열이 μ •μ˜λ˜μ–΄ μžˆλŠ” κ²½μš°μž…λ‹ˆλ‹€. λ‚΄ ν•΄κ²° 방법은 첫 번째 행에 더미 열을 μ •μ˜ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

train_fv_file = 'train_fv_eval.svm'
dtrain = xgb.DMatrix(train_fv_file, feature_names=feature_vector_labels, feature_types=feature_vector_types)

μž¬ν˜„ν•˜κΈ°κ°€ λ„ˆλ¬΄ 쉽닀면 μž¬ν˜„ κ°€λŠ₯ν•œ 예λ₯Ό μ œκ³΅ν•˜λŠ” 데 관심이 μžˆλŠ” μ‚¬λžŒμ΄ μžˆμŠ΅λ‹ˆκΉŒ? λ°”λžŒμ§ν•˜κ²ŒλŠ” sklearn λ ˆμ΄μ–΄ 없이(κ°€λŠ₯ν•œ 원인을 λΆ„λ¦¬ν•˜κΈ° μœ„ν•΄).

@gabrielspmoreira : λ§ˆμ§€λ§‰ λͺ‡ 개의 열이 μ™„μ „νžˆ ν¬λ°•ν•œ LIBSVM νŒŒμΌμ—μ„œ λ‘œλ“œν•˜λŠ” 것에 λŒ€ν•œ κ·€ν•˜μ˜ μš”μ μ„ μ•Œ 수 μžˆμŠ΅λ‹ˆλ‹€... ν•΄λ‹Ή DMatrix ꡬ성 방법은 num_col νžŒνŠΈλ„ 있으면 도움이 λ©λ‹ˆλ‹€.

In [42]: matrix = xgboost.DMatrix(scipy.sparse.csr_matrix([[0, 2, 3, 0], [0, 2, 2, 0], [1, 0, 5, 0], [0, 1, 0, 0]], shape=(4,4)))
In [43]: matrix.num_col()
Out[43]: 3L

ν–‰/μ—΄μ˜ ν•˜μœ„ μƒ˜ν”Œμ— μƒˆ DMatrixκ°€ 생성될 λ•Œλ§ˆλ‹€ μ΄λŸ¬ν•œ 일이 λ°œμƒν•  κ°€λŠ₯성이 μžˆμŠ΅λ‹ˆλ‹€(DMatrix에 열이 λͺ‡ κ°œμΈμ§€ λͺ…μ‹œμ μœΌλ‘œ μ•Œλ €μ€¬μŒμ—λ„ λΆˆκ΅¬ν•˜κ³  μ—΄ μˆ˜λŠ” μ€„μ–΄λ“­λ‹ˆλ‹€). μ΄λŠ” ν•˜μœ„ 집합이 λͺ¨λ‘ 0일 κ°€λŠ₯성이 더 λ†’κΈ° λ•Œλ¬Έμ— 더 μž‘μ€ 데이터 μ„ΈνŠΈ λ˜λŠ” 맀우 ν¬μ†Œν•œ 열에 λŒ€ν•΄ 자주 λ°œμƒν•©λ‹ˆλ‹€.

이것이 ν›ˆλ ¨/ν…ŒμŠ€νŠΈ μ„ΈνŠΈ 사이에 λ°œμƒν•˜λ©΄ λͺ¨λΈμ€ λ‹€λ₯Έ 수의 κΈ°λŠ₯을 κΈ°λŒ€ν•˜κ³  ValueErrorλ₯Ό λ‚΄λΏœκΈ° λ•Œλ¬Έμ— ν…ŒμŠ€νŠΈ μ„ΈνŠΈλ₯Ό 평가할 수 μ—†μŠ΅λ‹ˆλ‹€.

무슨 일이 μΌμ–΄λ‚˜κ³  μžˆλŠ”μ§€ ν™•μ‹ ν•  λ•Œ xgboost 코어와 sklearn-wrapper λ‚΄μ—μ„œ 이것이 μž‘λ™/μž‘λ™ν•˜μ§€ μ•ŠλŠ” ν…ŒμŠ€νŠΈλ₯Ό 찾으렀고 λ…Έλ ₯ν•˜κ³  μžˆμ§€λ§Œ μ–΄λ””μ—μ„œ μΌμ–΄λ‚˜κ³  μžˆλŠ”μ§€ λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€.

@l3link : κ·€ν•˜μ˜ μ½”λ“œκ°€ 였래된 것 κ°™μŠ΅λ‹ˆλ‹€. λ‚΄κ°€ μ–»λŠ” 것은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

In [2]: import scipy
   ...: import xgboost
   ...: matrix = xgboost.DMatrix(scipy.sparse.csr_matrix([[0, 2, 3, 0], [0, 2, 2, 0], [1, 0, 5, 0], [0, 1, 0, 0]], shape=(4,4)))
   ...: matrix.num_col()
   ...:
Out[2]: 4L

In [3]: matrix._init_from_csr??
Signature: matrix._init_from_csr(csr)
Source:
    def _init_from_csr(self, csr):
        """
        Initialize data from a CSR matrix.
        """
        if len(csr.indices) != len(csr.data):
            raise ValueError('length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data)))
        self.handle = ctypes.c_void_p()
        _check_call(_LIB.XGDMatrixCreateFromCSREx(c_array(ctypes.c_size_t, csr.indptr),
                                                  c_array(ctypes.c_uint, csr.indices),
                                                  c_array(ctypes.c_float, csr.data),
                                                  len(csr.indptr), len(csr.data),
                                                  csr.shape[1],
                                                  ctypes.byref(self.handle)))
File:      c:\anaconda2\lib\site-packages\xgboost-0.6-py2.7.egg\xgboost\core.py
Type:      instancemethod

뭐,

In [64]: xgboost.__version__ Out[64]: '0.6'

Signature: matrix._init_from_csr(csr) Source: def _init_from_csr(self, csr): """ Initialize data from a CSR matrix. """ if len(csr.indices) != len(csr.data): raise ValueError('length mismatch: {} vs {}'.format(len(csr.indices), len(csr.data))) self.handle = ctypes.c_void_p() _check_call(_LIB.XGDMatrixCreateFromCSR(c_array(ctypes.c_ulong, csr.indptr), c_array(ctypes.c_uint, csr.indices), c_array(ctypes.c_float, csr.data), len(csr.indptr), len(csr.data), ctypes.byref(self.handle))) File: ~/anaconda/lib/python2.7/site-packages/xgboost/core.py Type: instancemethod

λ‚΄ .6 버전에 λͺ¨μ–‘을 μ·¨ν•˜μ§€ μ•ŠλŠ” XGDMatrixCreateFromCSREx λͺ…λ Ή λŒ€μ‹  XGDMatrixCreateFromCSR이 μžˆλ‹€λŠ” 것이 μ΄μƒν•˜κ²Œ λ³΄μž…λ‹ˆλ‹€.
osx 배포가 λ‹€λ₯Ό 수 μžˆμŠ΅λ‹ˆκΉŒ?

λ˜ν•œ @bryan-woodsκ°€ λ³΄κ³ ν•œ λ‚΄μš©μœΌλ‘œ 인해 ν¬μ†Œ ν–‰λ ¬ μ²˜λ¦¬μ— 근본적인 λ¬Έμ œκ°€ μžˆλ‹€λŠ” 것이 λ‘λ ΅μŠ΅λ‹ˆλ‹€. "feature_names(self)"κ°€ 두 μ„ΈνŠΈμ— λŒ€ν•΄ λ™μΌν•œ κΈ°λŠ₯ λͺ©λ‘μ„ λ°˜ν™˜ν•˜κΈ° λ•Œλ¬Έμ— 였λ₯˜κ°€ λ°œμƒν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 기차와 ν…ŒμŠ€νŠΈ 간에 0이 μ•„λ‹Œ μ—΄ μΈλ±μŠ€κ°€ μΌμΉ˜ν•˜μ§€ μ•ŠκΈ° λ•Œλ¬Έμ— 예츑이 잘λͺ»λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

λˆ„κ΅°κ°€ 이 μ§ˆλ¬Έμ— λŒ€λ‹΅ν•΄ μ£Όμ‹œκ² μŠ΅λ‹ˆκΉŒ? 0.4 λ²„μ „μœΌλ‘œ 되돌렸고 이제 μž‘λ™ν•˜λŠ” 것 κ°™μ§€λ§Œ μ—¬μ „νžˆ ν¬μ†Œ 행렬을 μ‚¬μš©ν•˜κ³  있기 λ•Œλ¬Έμ— μ œλŒ€λ‘œ μž‘λ™ν•˜λŠ”μ§€ κ±±μ •λ©λ‹ˆλ‹€.

@l3link 이상 ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 버전 번호(λ˜λŠ” pypi νŒ¨ν‚€μ§€)λŠ” λ•Œλ•Œλ‘œ μ˜€λž«λ™μ•ˆ μ—…λ°μ΄νŠΈλ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄ 였늘 ν˜„μž¬ https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/VERSION νŒŒμΌμ€ 7μ›” 29일에 λ§ˆμ§€λ§‰μœΌλ‘œ λ³€κ²½λ˜μ—ˆμœΌλ©° λ§ˆμ§€λ§‰ pypi νŒ¨ν‚€μ§€λŠ” https://pypi.pythonμž…λ‹ˆλ‹€. org/pypi/xgboost/의 λ‚ μ§œλŠ” 8μ›” 9μΌμž…λ‹ˆλ‹€. μˆ˜μ • 사항이 9μ›” 23일 #1606에 제좜된 λ™μ•ˆ. githubμ—μ„œ μ΅œμ‹  μ½”λ“œλ₯Ό ν™•μΈν•˜μ„Έμš”.

νŒ¬λ” DataFrame (λΉ„ν¬μ†Œ ν‘œν˜„)λ₯Ό μ‚¬μš©ν•  λ•Œ 이 λ¬Έμ œκ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€.
numpy ndarray λ₯Ό 톡해 df.as_matrix() 둜 λ³€ν™˜ν–ˆκ³  였λ₯˜λ₯Ό μ œκ±°ν–ˆμŠ΅λ‹ˆλ‹€.

데이터 ν”„λ ˆμž„μ„ λ°°μ—΄λ‘œ λ³€ν™˜ν•œ ν›„ 이 였λ₯˜λ„ μ œκ±°ν–ˆμŠ΅λ‹ˆλ‹€.

ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ˜ 열을 κΈ°μ°¨ μ„ΈνŠΈμ™€ 같은 μˆœμ„œλ‘œ μž¬μ •λ ¬ν•˜λ©΄ 이 λ¬Έμ œκ°€ ν•΄κ²°λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
Pandas 데이터 ν”„λ ˆμž„μ„ μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€. 이것이 μ—†μœΌλ©΄ .as_matrix() λ₯Ό μ‚¬μš©ν•˜λ©΄ λ™μΌν•œ λ¬Έμ œκ°€ λ°œμƒν•©λ‹ˆλ‹€.

λ‚˜λŠ” ν–ˆλ‹€:

test = test[train.columns]

@warpuv μ†”λ£¨μ…˜μ„ μ‹œλ„

κΈ°μ°¨/ν…ŒμŠ€νŠΈ csr 행렬을 csc둜 λ³€ν™˜ν•˜λŠ” 것이 νš¨κ³Όμ μ΄μ—ˆμŠ΅λ‹ˆλ‹€.

Xtrain = scipy.sparse.csc_matrix(Xtrain)

csc_matrix λ³€ν™˜ν•˜λŠ” μž‘μ—…μ€ 0.6a2 μ—μ„œ ν…ŒμŠ€νŠΈλ˜μ—ˆμŠ΅λ‹ˆλ‹€.

    X_train = scipy.sparse.csc_matrix(X_train)
    X_test = scipy.sparse.csc_matrix(X_test)

    xgb_train = xgb.DMatrix(X_train, label=y_train)
    xgb_test = xgb.DMatrix(X_test, label=y_test)
type(X_train) <class 'scipy.sparse.csr.csr_matrix'>
type(X_test) <class 'scipy.sparse.csr.csr_matrix'>
type(X_train) <class 'scipy.sparse.csc.csc_matrix'>
type(X_test) <class 'scipy.sparse.csc.csc_matrix'>
type(xgb_train) <class 'xgboost.core.DMatrix'>
type(xgb_test) <class 'xgboost.core.DMatrix'>

λ‚΄ μ›λž˜ ν¬μ†Œ 행렬은 csr_matrix ν˜•μ‹μ˜ sklearn tf-idf λ²‘ν„°λΌμ΄μ €μ˜ 좜λ ₯μž…λ‹ˆλ‹€.

아직 μˆ˜μ • 사항이 μžˆμŠ΅λ‹ˆκΉŒ?

방금 python3μ—μ„œ μ΅œμ‹  버전(0.7.post3)을 λΉŒλ“œν–ˆμœΌλ©° 이 λ¬Έμ œκ°€ μ—¬μ „νžˆ μ‘΄μž¬ν•¨μ„ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€. μœ„μ˜ @dmcgarry 예제λ₯Ό μ μš©ν•œ csr_matrix 및 csc_matrix λͺ¨λ‘μ— μ—¬μ „νžˆ λ¬Έμ œκ°€ μžˆμŠ΅λ‹ˆλ‹€.

import xgboost as xgb
import numpy as np
from scipy import sparse

np.random.seed(10)

X_csr = sparse.rand(100, 10).tocsr()
test_csr = sparse.rand(10, 500).tocsr()

X_csc = sparse.rand(100, 10).tocsc()
test_csc = sparse.rand(10, 500).tocsc()

y = np.random.randint(2, size=100)

clf_csr = xgb.XGBClassifier()
clf_csr.fit(X_csr, y)

clf_csc = xgb.XGBClassifier()
clf_csc.fit(X_csc, y)

# Try with csr
try:
    pred = clf_csr.predict_proba(test_csr)
    print("Works when csr with version %s" %xgb.__version__)
except ValueError:
    print("Broken when csr with version %s" %xgb.__version__)

try:
    test_csr[0,(test_csr.shape[1]-1)] = 1.0
    pred = clf_csr.predict_proba(test_csr)
    print("Works when adding non-zero entries to last column with version %s" %xgb.__version__)
except:
    print("Still broken when adding non-zero entries to last column with version %s" %xgb.__version__)

# Try with csc
try:
    pred = clf_csc.predict_proba(test_csc)
    print("Works when csc with version %s" %xgb.__version__)
except ValueError:
    print("Broken when csc with version %s" %xgb.__version__)

try:
    test_csc[0,(test_csc.shape[1]-1)] = 1.0
    pred = clf_csc.predict_proba(test_csc)
    print("Works when adding non-zero entries to last column with version %s" %xgb.__version__)
except:
    print("Still broken when adding non-zero entries to last column with version %s" %xgb.__version__)

μœ„ μ½”λ“œμ˜ κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

Broken when csr with version 0.7
Still broken when adding non-zero entries to last column with version 0.7
Broken when csc with version 0.7
Still broken when adding non-zero entries to last column with version 0.7

λ„μ™€μ£Όμ„Έμš”

이 λ¬Έμ œκ°€ μ’…λ£Œλœ μ΄μœ λŠ” λ¬΄μ—‡μž…λ‹ˆκΉŒ?

λ‚˜λŠ” μ΅œκ·Όμ— 두 번이 문제λ₯Ό κ²ͺμ—ˆμŠ΅λ‹ˆλ‹€. ν•œ 가지 κ²½μš°μ—λŠ” μž…λ ₯ 데이터 ν”„λ ˆμž„μ„ λ°°μ—΄λ‘œ λ³€κ²½ν•˜κΈ°λ§Œ ν•˜λ©΄ μž‘λ™ν•©λ‹ˆλ‹€. 두 번째 경우 test_df = test_df[train_df.columns]λ₯Ό μ‚¬μš©ν•˜μ—¬ ν…ŒμŠ€νŠΈ 데이터 ν”„λ ˆμž„μ˜ μ—΄ 이름을 μž¬μ •λ ¬ν•΄μ•Ό ν•©λ‹ˆλ‹€. 두 경우 λͺ¨λ‘ train_df 및 test_dfλŠ” μ •ν™•νžˆ λ™μΌν•œ μ—΄ 이름을 κ°–μŠ΅λ‹ˆλ‹€.

@CathyQian κ·€ν•˜μ˜ μ˜κ²¬μ„ μ΄ν•΄ν•˜μ§€ λͺ»ν•˜λŠ” 것 κ°™μŠ΅λ‹ˆλ‹€. train_df / test_df ν¬μ†Œν•©λ‹ˆκΉŒ? λ˜ν•œ μ΄λŸ¬ν•œ λ¬Έμ œκ°€ λ°œμƒν–ˆμ„ λ•Œ μ–΄λ–€ λ²„μ „μ˜ xgboostλ₯Ό μ‹€ν–‰ν•˜κ³  μžˆμ—ˆμŠ΅λ‹ˆκΉŒ?

@CathyQian xgboostλŠ” μ—΄μ˜ _order_에 μ˜μ‘΄ν•˜λ©° μ΄λŠ” 이 λ¬Έμ œμ™€ 관련이 μ—†μŠ΅λ‹ˆλ‹€.

@ewellinger WRT 예: 10개 κΈ°λŠ₯이 μžˆλŠ” 데이터에 λŒ€ν•΄ ν›ˆλ ¨λœ λͺ¨λΈμ€ μ˜ˆμΈ‘μ„ μœ„ν•΄ 500개 κΈ°λŠ₯이 μžˆλŠ” 데이터λ₯Ό ν—ˆμš©ν•˜μ§€ μ•Šμ•„μ•Ό ν•˜λ―€λ‘œ 였λ₯˜κ°€ λ°œμƒν•©λ‹ˆλ‹€. λ˜ν•œ λͺ¨λ“  ν–‰λ ¬μ—μ„œ DMatricesλ₯Ό λ§Œλ“€κ³  num_col 및 num_rowλ₯Ό κ²€μ‚¬ν•˜λ©΄ μ˜ˆμƒν•œ κ²°κ³Όκ°€ μƒμ„±λ©λ‹ˆλ‹€.

"ν¬μ†Œμ„± 문제"의 ν˜„μž¬ μƒνƒœλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  • CSRμ—μ„œ DMatrix 생성 및 λͺ¨λΈμ—μ„œμ˜ μ‚¬μš©μ΄ μ˜¬λ°”λ₯΄κ²Œ μž‘λ™ν•΄μ•Ό ν•©λ‹ˆλ‹€. 이 문제의 μ£Όμ œμ˜€κΈ° λ•Œλ¬Έμ— λ¬Έμ œκ°€ μ’…λ£Œλ˜μ—ˆμŠ΅λ‹ˆλ‹€.
  • CSCμ—μ„œ DMatrix 생성은 μ˜¬λ°”λ₯Έ μ°¨μ›μ˜ 객체λ₯Ό μƒμ„±ν•˜μ§€λ§Œ λ§ˆμ§€λ§‰ 행이 μ™„μ „νžˆ ν¬μ†ŒμΈ 경우 ν›ˆλ ¨ λ˜λŠ” 예츑 쀑에 잘λͺ»λœ κ²°κ³Όλ₯Ό μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€ #2630. 아직 κ·Έ 뢀뢄을 μ œλŒ€λ‘œ κ³ μΉ  μ‹œκ°„μ΄ μ—†μ—ˆλ‹€.
  • libsvm 데이터λ₯Ό DMatrix에 λ‘œλ“œν•  λ•Œ 미리 μ •μ˜λœ μ—΄ 수λ₯Ό μ§€μ •ν•˜λŠ” λ§€κ°œλ³€μˆ˜λŠ” 아직 κ΅¬ν˜„λ˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. κΈ°μ—¬ν•  μžμ›λ΄‰μ‚¬μžλ₯Ό ν™˜μ˜ν•©λ‹ˆλ‹€.

@warpuv 그것은 μ €μ—κ²Œ νš¨κ³Όμ μž…λ‹ˆλ‹€. κ°μ‚¬ν•©λ‹ˆλ‹€.

μ‘°λ°€ν•œ ν–‰λ ¬μ—μ„œ λ™μΌν•œ 였λ₯˜κ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€. ( μ΅œμ‹  μ•„λ‚˜μ½˜λ‹€μ˜ xgboost v.0.6.)
ν›ˆλ ¨ μƒ˜ν”Œμ˜ λ‹€λ₯Έ κΈ°λŠ₯ ν•˜μœ„ 집합에 λŒ€ν•΄ μ—¬λŸ¬ νšŒκ·€λ₯Ό μ‹€ν–‰ν•  λ•Œ 였λ₯˜κ°€ λ°œμƒν–ˆμŠ΅λ‹ˆλ‹€.
λ‹€μŒ νšŒκ·€λ₯Ό λ§žμΆ”κΈ° 전에 맀번 μƒˆ λͺ¨λΈ μΈμŠ€ν„΄μŠ€λ₯Ό μƒμ„±ν•˜λ©΄ λ¬Έμ œκ°€ ν•΄κ²°λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

  • libsvm 데이터λ₯Ό DMatrix에 λ‘œλ“œν•  λ•Œ 미리 μ •μ˜λœ μ—΄ 수λ₯Ό μ§€μ •ν•˜λŠ” λ§€κ°œλ³€μˆ˜λŠ” 아직 κ΅¬ν˜„λ˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. κΈ°μ—¬ν•  μžμ›λ΄‰μ‚¬μžλ₯Ό ν™˜μ˜ν•©λ‹ˆλ‹€.

0.8λΆ€ν„° 이거 아직 μ—†μž–μ•„μš”?

CSCμ—μ„œ DMatrix 생성은 μ˜¬λ°”λ₯Έ μ°¨μ›μ˜ 객체λ₯Ό μƒμ„±ν•˜μ§€λ§Œ λ§ˆμ§€λ§‰ 행이 μ™„μ „νžˆ ν¬μ†ŒμΈ 경우 ν›ˆλ ¨ λ˜λŠ” 예츑 쀑에 잘λͺ»λœ κ²°κ³Όλ₯Ό μ œκ³΅ν•  수 μžˆμŠ΅λ‹ˆλ‹€ #2630. 아직 κ·Έ 뢀뢄을 μ œλŒ€λ‘œ κ³ μΉ  μ‹œκ°„μ΄ μ—†μ—ˆλ‹€.

@khotilov #3553이 이 문제λ₯Ό ν•΄κ²°ν–ˆμŠ΅λ‹ˆλ‹€.

libsvm 데이터λ₯Ό DMatrix에 λ‘œλ“œν•  λ•Œ 미리 μ •μ˜λœ μ—΄ 수λ₯Ό μ§€μ •ν•˜λŠ” λ§€κ°œλ³€μˆ˜λŠ” 아직 κ΅¬ν˜„λ˜μ§€ μ•Šμ•˜μŠ΅λ‹ˆλ‹€. κΈ°μ—¬ν•  μžμ›λ΄‰μ‚¬μžλ₯Ό ν™˜μ˜ν•©λ‹ˆλ‹€.

@MonsieurWave 이 κΈ°λŠ₯의 경우 dmlc-core에 λŒ€ν•œ μž‘μ€ pull μš”μ²­μ΄ νŠΈλ¦­μ„ μˆ˜ν–‰ν•΄μ•Ό ν•©λ‹ˆλ‹€. λ‚΄κ°€ 그것을 보자.

@hcho3 κ°μ‚¬ν•©λ‹ˆλ‹€.

μ§€κΈˆμ€ libsvm의 첫 번째 쀄을 λ„ˆλ¬΄ ν¬λ°•ν•˜μ§€ μ•Šκ²Œ ν•˜μ—¬ 이 문제λ₯Ό μš°νšŒν•©λ‹ˆλ‹€. 즉, 값이 0인 열도 μ €μž₯ν•©λ‹ˆλ‹€.

이 νŽ˜μ΄μ§€κ°€ 도움이 λ˜μ—ˆλ‚˜μš”?
0 / 5 - 0 λ“±κΈ‰