Scikit-learn: FeatureUnion์—์„œ ์–ด๋–ค ๊ธฐ๋Šฅ์„ ์„ ํƒํ–ˆ๋Š”์ง€ ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

์— ๋งŒ๋“  2016๋…„ 01์›” 06์ผ  ยท  4์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: scikit-learn/scikit-learn

๋‚˜๋Š” ์ฝ”๋“œ๋ฅผ ์‹คํ–‰,
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example -feature-stacker-py
๋‹ค์Œ ์ฝ”๋“œ์™€ ํ•จ๊ป˜,

# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

FeatureUnion์— ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์„ ๋•Œ ์–ด๋–ค ๊ธฐ๋Šฅ์ด ์„ ํƒ๋˜์—ˆ๋Š”์ง€ ์•Œ๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. FeatureUnion ๋ฌธ์„œ์—๋Š” ๋ชจ๋“  ๋ณ€ํ™˜๊ธฐ์—์„œ ๋ชจ๋“  ์ด๋ฆ„์„ ๊ฐ€์ ธ์˜ค๋Š” get_feature_names() ํ•จ์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

AttributeError: Transformer pca does not provide get_feature_names.

์‚ฌ์‹ค, ๋‚˜๋Š” pca์— ์ด๋Ÿฐ ๊ธฐ๋Šฅ์ด ์—†๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์™œ FeatureUnion์ด ์ด ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๊นŒ!?

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

์ด๊ฒƒ์€ PCA์˜ ํŠน์ • ๋ฌธ์ œ๋ฅผ ์ง์ ‘ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๊ท€ํ•˜์˜ ์งˆ๋ฌธ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฝ์œผ๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ์ž ์ •์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•ด ์†์„ฑ์„ ์—ฌ๊ณผํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ถ๊ธˆํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹น์‚ฌ์ž์—๊ฒŒ ๋Šฆ์—ˆ์ง€๋งŒ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ  ์ ์ ˆํ•œ ๋‹จ๊ณ„(featureunion ๋‚ด์—์„œ๋„)๋ฅผ ์ฐพ์€ ๋‹ค์Œ ์ ์ ˆํ•œ ์†์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ์–ผ๋งˆ๋‚˜ ๋ณต์žกํ•œ์ง€์— ๊ด€๊ณ„์—†์ด ํŒŒ์ดํ”„๋ผ์ธ ๋‚ด์˜ ์š”์†Œ์— ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๋ฐฉ๊ธˆ ์‹คํ–‰ํ•œ ์˜ˆ์ž…๋‹ˆ๋‹ค.

pipeline = Pipeline([ ('union', FeatureUnion([ ('categoric', Pipeline([ ('f_cat', feature_type_split(type = 'categoric')), #returns categoric in array for vect ('vect', vect), ])), ('numeric', Pipeline([ ('f_num', feature_type_split(type = 'numeric')), ])), ])), ('select', ff), ('tree_clf', clf), ])

print(pipeline)๋ฅผ ํ†ตํ•ด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐ์ฒด ์ž์ฒด๋ฅผ ํ‘œ์‹œํ•˜๋ฉด ์ฐธ์กฐ ํฌ์ธํŠธ๊ฐ€ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

Pipeline(steps=[('union', FeatureUnion(n_jobs=1, transformer_list=[('categoric', Pipeline(steps=[('f_cat', feature_type_split(type='categoric')), ('vect', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True))])), ('numeric', Pipeline(steps=[('f_num', feature_type...it=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'))])

๊ทธ๋ž˜์„œ ๋‹ค์Œ์„ ํ†ตํ•ด ํ†ตํ•ฉ ๋‹จ๊ณ„๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

pipeline.named_steps['union']

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‹ค์Œ์„ ํ†ตํ•ด ๋ณ€ํ™˜๊ธฐ_๋ชฉ๋ก(๋˜๋Š” ๋ฒ”์ฃผํ˜• ํŒŒ์ดํ”„๋ผ์ธ)์ธ ๋‹ค์Œ ์ˆ˜์ค€์œผ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

pipeline.named_steps['union'].transformer_list[0]

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‹ค์Œ์„ ํ†ตํ•ด ๋ฒ”์ฃผํ˜• ํŒŒ์ดํ”„๋ผ์ธ ๋‚ด์˜ ๋‹จ๊ณ„์ธ ๋‹ค์Œ ์ˆ˜์ค€์œผ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

pipeline.named_steps['union'].transformer_list[0][1]

์œ„์˜ ๊ฒฐ๊ณผ๋Š” ์ด์ œ named_steps๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ผ๋ฐ˜์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect']

๋”ฐ๋ผ์„œ ๋‹ค์Œ์„ ํ†ตํ•ด ํ•„์š”ํ•œ ์†์„ฑ์— ์•ก์„ธ์Šคํ•ฉ๋‹ˆ๋‹ค.
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect'].get_feature_names()

TLDR;
์‚ฌ์šฉ์ž ์ง€์ • ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๊ณ  ํ•ด๋‹น ๋ณ€ํ™˜/์ถ”์ •๊ธฐ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์†์„ฑ์— ์•ก์„ธ์Šคํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  4 ๋Œ“๊ธ€

์–ด๋–ค ๊ธฐ๋Šฅ์ด ์–ด๋–ค ๊ธฐ๋Šฅ์— ์†ํ•˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์–ด์•ผ ํ•œ๋‹ค๋Š” ๋ฐ ๋™์˜ํ•ฉ๋‹ˆ๋‹ค.
๊ตฌ์„ฑ ์š”์†Œ, ๊ทธ๋ฆฌ๊ณ  ๋‚˜๋Š” ์˜ค๋ž˜์ „์— ์ด๊ฒƒ์„ ์ œ์•ˆํ–ˆ์ง€๋งŒ, ๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ์•„๋‹ˆ๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.
ํ˜„์žฌ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

2016๋…„ 1์›” 6์ผ 18:23์— genliu777 [email protected] ์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ผ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ฝ”๋“œ๋ฅผ ์‹คํ–‰,

http://scikit-learnorg/stable/auto_examples/feature_stackerhtml#example -feature-stacker-py
๋‹ค์Œ ์ฝ”๋“œ์™€ ํ•จ๊ป˜,

PCA ๋ฐ ์ผ๋ณ€๋Ÿ‰ ์„ ํƒ์—์„œ ์ถ”์ •๊ธฐ ๊ตฌ์ถ•:

Combine_features = FeatureUnion([("pca", pca), ("univ_select", ์„ ํƒ)])

๊ฒฐํ•ฉ๋œ ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์„ธํŠธ ๋ณ€ํ™˜:

X_features = Combined_featuresfit(X, y)transform(X)

FeatureUnion์— ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์—ˆ์„ ๋•Œ ์–ด๋–ค ๊ธฐ๋Šฅ์ด ์„ ํƒ๋˜์—ˆ๋Š”์ง€ ์•Œ๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.
FeatureUnion ๋ฌธ์„œ์—๋Š” get_feature_names() ํ•จ์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
๋ชจ๋“  ๋ณ€ํ™˜๊ธฐ์—์„œ ์˜ ๋ชจ๋“  ์ด๋ฆ„์„ ๊ฐ€์ ธ์˜ค๋ฏ€๋กœ ์ด ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.
๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
AttributeError: ๋ณ€ํ™˜๊ธฐ pca๊ฐ€ get_feature_names๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
์‹ค์ œ๋กœ, ๋‚˜๋Š” pca์— ์ด์™€ ๊ฐ™์€ ๊ธฐ๋Šฅ์ด ์—†๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ์ง€๋งŒ ์™œ FeatureUnion์ด
์ด ๊ธฐ๋Šฅ์„ ์ œ๊ณต!?

โ€”
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ฑฐ๋‚˜ GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/scikit-learn/scikit-learn/issues/6122.

์™œ ํ˜„์žฌ ๋ถˆ๊ฐ€๋Šฅ!? FeatureUnion์€ get_feature_names() ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ ์ž‘๋™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค!

์•„๋งˆ๋„ sklearn์˜ ๋ชจ๋“  ๋ชจ๋ธ์—๋Š” fit ๋ฐ transform ๊ธฐ๋Šฅ์ด ์žˆ๊ณ  FeatureUnion์— ๋„ฃ์„ ์ˆ˜ ์žˆ๊ณ  ์ž˜ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋ฉฐ ์†์„ฑ์„ ์†Œ์Šค๋กœ ์ œ๊ณตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. get_feature_names() if not hasattr(trans, 'get_feature_names'): $#$ ๋ฅผ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด FeatureUnion์ด get_feature_names() ์˜ ๊ธฐ๋Šฅ์„ ๋ฐ˜๋“œ์‹œ ์ œ๊ณตํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค!!

์ด๊ฒƒ์€ PCA์˜ ํŠน์ • ๋ฌธ์ œ๋ฅผ ์ง์ ‘ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๊ท€ํ•˜์˜ ์งˆ๋ฌธ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฝ์œผ๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ์ž ์ •์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•ด ์†์„ฑ์„ ์—ฌ๊ณผํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ถ๊ธˆํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹น์‚ฌ์ž์—๊ฒŒ ๋Šฆ์—ˆ์ง€๋งŒ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ  ์ ์ ˆํ•œ ๋‹จ๊ณ„(featureunion ๋‚ด์—์„œ๋„)๋ฅผ ์ฐพ์€ ๋‹ค์Œ ์ ์ ˆํ•œ ์†์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ์–ผ๋งˆ๋‚˜ ๋ณต์žกํ•œ์ง€์— ๊ด€๊ณ„์—†์ด ํŒŒ์ดํ”„๋ผ์ธ ๋‚ด์˜ ์š”์†Œ์— ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๋ฐฉ๊ธˆ ์‹คํ–‰ํ•œ ์˜ˆ์ž…๋‹ˆ๋‹ค.

pipeline = Pipeline([ ('union', FeatureUnion([ ('categoric', Pipeline([ ('f_cat', feature_type_split(type = 'categoric')), #returns categoric in array for vect ('vect', vect), ])), ('numeric', Pipeline([ ('f_num', feature_type_split(type = 'numeric')), ])), ])), ('select', ff), ('tree_clf', clf), ])

print(pipeline)๋ฅผ ํ†ตํ•ด ํŒŒ์ดํ”„๋ผ์ธ ๊ฐ์ฒด ์ž์ฒด๋ฅผ ํ‘œ์‹œํ•˜๋ฉด ์ฐธ์กฐ ํฌ์ธํŠธ๊ฐ€ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

Pipeline(steps=[('union', FeatureUnion(n_jobs=1, transformer_list=[('categoric', Pipeline(steps=[('f_cat', feature_type_split(type='categoric')), ('vect', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True))])), ('numeric', Pipeline(steps=[('f_num', feature_type...it=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'))])

๊ทธ๋ž˜์„œ ๋‹ค์Œ์„ ํ†ตํ•ด ํ†ตํ•ฉ ๋‹จ๊ณ„๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

pipeline.named_steps['union']

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‹ค์Œ์„ ํ†ตํ•ด ๋ณ€ํ™˜๊ธฐ_๋ชฉ๋ก(๋˜๋Š” ๋ฒ”์ฃผํ˜• ํŒŒ์ดํ”„๋ผ์ธ)์ธ ๋‹ค์Œ ์ˆ˜์ค€์œผ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

pipeline.named_steps['union'].transformer_list[0]

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‹ค์Œ์„ ํ†ตํ•ด ๋ฒ”์ฃผํ˜• ํŒŒ์ดํ”„๋ผ์ธ ๋‚ด์˜ ๋‹จ๊ณ„์ธ ๋‹ค์Œ ์ˆ˜์ค€์œผ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

pipeline.named_steps['union'].transformer_list[0][1]

์œ„์˜ ๊ฒฐ๊ณผ๋Š” ์ด์ œ named_steps๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ผ๋ฐ˜์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect']

๋”ฐ๋ผ์„œ ๋‹ค์Œ์„ ํ†ตํ•ด ํ•„์š”ํ•œ ์†์„ฑ์— ์•ก์„ธ์Šคํ•ฉ๋‹ˆ๋‹ค.
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect'].get_feature_names()

TLDR;
์‚ฌ์šฉ์ž ์ง€์ • ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์กฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๊ณ  ํ•ด๋‹น ๋ณ€ํ™˜/์ถ”์ •๊ธฐ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์†์„ฑ์— ์•ก์„ธ์Šคํ•ฉ๋‹ˆ๋‹ค.

scikit-learn์˜ get_feature_names ๊ฐ€ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์— ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋Š” eli5 ์˜ transform_feature_names ๋ฅผ ์‹œ๋„ํ•˜์‹ญ์‹œ์˜ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰