Scikit-learn: how to know which feature is selected by FeatureUnion?

Created on 6 Jan 2016  ·  4Comments  ·  Source: scikit-learn/scikit-learn

i run the code of,
http://scikit-learn.org/stable/auto_examples/feature_stacker.html#example-feature-stacker-py
and with the following code,

# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

with data put into FeatureUnion, i want to know which feature is selected. in the doc of FeatureUnion, there is a funtion get_feature_names() which gets all the names from all the transformer. so just call this function and get error like this,

AttributeError: Transformer pca does not provide get_feature_names.

actually, i know pca does not have function like this. but why FeatureUnion provide this function!?

Most helpful comment

This may not address your particular issue with PCA directly, but if I read into your question correctly, you are wondering in general how to percolate attributes through the custom pipeline.

Late to the party, but you can access elements within the pipeline, regardless how complicated, by walking through the pipeline structure, finding the appropriate step (even within featureunion) and then using the appropriate attribute. Here is an example I just ran:

pipeline = Pipeline([ ('union', FeatureUnion([ ('categoric', Pipeline([ ('f_cat', feature_type_split(type = 'categoric')), #returns categoric in array for vect ('vect', vect), ])), ('numeric', Pipeline([ ('f_num', feature_type_split(type = 'numeric')), ])), ])), ('select', ff), ('tree_clf', clf), ])

Showing the pipeline object itself via print(pipeline) gives me a point of reference:

Pipeline(steps=[('union', FeatureUnion(n_jobs=1, transformer_list=[('categoric', Pipeline(steps=[('f_cat', feature_type_split(type='categoric')), ('vect', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True))])), ('numeric', Pipeline(steps=[('f_num', feature_type...it=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'))])

So I walk through to the union step via:

pipeline.named_steps['union']

Then walk to the next level which is transformer_list (or the categoric pipeline) via:

pipeline.named_steps['union'].transformer_list[0]

Then walk to the next level which is the steps within the categoric pipeline via:

pipeline.named_steps['union'].transformer_list[0][1]

The above outputs a typical pipeline structure, where we can now utilize named_steps:
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect']

And therefore access the attribute we need via:
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect'].get_feature_names()

TLDR;
Walk through the pipeline structure piece by piece with your custom pipeline, and then access the attribute as you would normally for that transform/estimator piece.

All 4 comments

I agree that there should be a way to see which features belong to which
components, and I have long ago proposed this, but I don't think it's
currently possible.

On 6 January 2016 at 18:23, genliu777 [email protected] wrote:

i run the code of,

http://scikit-learnorg/stable/auto_examples/feature_stackerhtml#example-feature-stacker-py
and with the following code,

Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])

Use combined features to transform dataset:

X_features = combined_featuresfit(X, y)transform(X)

with data put into FeatureUnion, i want to know which feature is selected
in the doc of FeatureUnion, there is a funtion get_feature_names() which
gets all the names of from all the transformer so just call this function
and get error like this,
AttributeError: Transformer pca does not provide get_feature_names
actually, i know pca does not have function like this but why FeatureUnion
provide this function!?


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/6122.

why currently impossible!? you know, FeatureUnion gives the function get_feature_names(), and it also should work!

like, maybe all of them, models in sklearn, have the function fit and transform, it should make all the models which can be put in FeatureUnion and work well , provide the attribute as the source code of get_feature_names() calls if not hasattr(trans, 'get_feature_names'): . otherwise, FeatureUnion do not necessarily provide the funciton of get_feature_names()!!

This may not address your particular issue with PCA directly, but if I read into your question correctly, you are wondering in general how to percolate attributes through the custom pipeline.

Late to the party, but you can access elements within the pipeline, regardless how complicated, by walking through the pipeline structure, finding the appropriate step (even within featureunion) and then using the appropriate attribute. Here is an example I just ran:

pipeline = Pipeline([ ('union', FeatureUnion([ ('categoric', Pipeline([ ('f_cat', feature_type_split(type = 'categoric')), #returns categoric in array for vect ('vect', vect), ])), ('numeric', Pipeline([ ('f_num', feature_type_split(type = 'numeric')), ])), ])), ('select', ff), ('tree_clf', clf), ])

Showing the pipeline object itself via print(pipeline) gives me a point of reference:

Pipeline(steps=[('union', FeatureUnion(n_jobs=1, transformer_list=[('categoric', Pipeline(steps=[('f_cat', feature_type_split(type='categoric')), ('vect', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True))])), ('numeric', Pipeline(steps=[('f_num', feature_type...it=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best'))])

So I walk through to the union step via:

pipeline.named_steps['union']

Then walk to the next level which is transformer_list (or the categoric pipeline) via:

pipeline.named_steps['union'].transformer_list[0]

Then walk to the next level which is the steps within the categoric pipeline via:

pipeline.named_steps['union'].transformer_list[0][1]

The above outputs a typical pipeline structure, where we can now utilize named_steps:
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect']

And therefore access the attribute we need via:
print pipeline.named_steps['union'].transformer_list[0][1].named_steps['vect'].get_feature_names()

TLDR;
Walk through the pipeline structure piece by piece with your custom pipeline, and then access the attribute as you would normally for that transform/estimator piece.

Please try eli5's transform_feature_names which can work in cases where scikit-learn's get_feature_names doesn't.

Was this page helpful?
0 / 5 - 0 ratings