Scikit-learn: Cannot get feature names after ColumnTransformer

Created on 6 Nov 2018 · 13Comments · Source: scikit-learn/scikit-learn

When I use ColumnTransformer to preprocess different columns (include numeric, category, text) with pipeline, I cannot get the feature names of the final transformed data, which is hard for debugging.

Here is the code:

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')

data = pd.read_csv(titanic_url)

target = data.pop('survived')

numeric_columns = ['age','sibsp','parch']
category_columns = ['pclass','sex','embarked']
text_columns = ['name','home.dest']

numeric_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='median')),
    ('scaler',StandardScaler()
    )
])
category_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='constant',fill_value='missing')),
    ('ohe',OneHotEncoder(handle_unknown='ignore'))
])
text_transformer = Pipeline(steps=[
    ('cntvec',CountVectorizer())
])

preprocesser = ColumnTransformer(transformers=[
    ('numeric',numeric_transformer,numeric_columns),
    ('category',category_transformer,category_columns),
    ('text',text_transformer,text_columns[0])
])

preprocesser.fit_transform(data)

preprocesser.get_feature_names() will get error:
AttributeError: Transformer numeric (type Pipeline) does not provide get_feature_names.
In ColumnTransformer，text_transformer can only process a string (eg 'Sex'), but not a list of string as text_columns

Source

pjgao

👍9 👀1

Most helpful comment

This is not an issue about ColumnTransformer.

is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

pjgao on 6 Nov 2018

👍19 😄2

All 13 comments

This is not an issue about ColumnTransformer.

is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

jnothman on 6 Nov 2018

👍1

This is not an issue about ColumnTransformer.

is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

pjgao on 6 Nov 2018

👍19 😄2

With respect to eli5, see transform_feature_names (used by explain_weights)

jnothman on 6 Nov 2018

1 is a duplicate of #6425, right? I want to write a slep on that.
I think supporting multiple text columns is pretty easy with ColumnTransformer. It's not the most pretty code but you could just add a CountVectorizer for each text column.

And your snippet doesn't really solve the issue because no get_feature_names doesn't mean you can just use the column names.

amueller on 7 Nov 2018

1 is a duplicate of #6425, right? I want to write a slep on that.
I think supporting multiple text columns is pretty easy with ColumnTransformer. It's not the most pretty code but you could just add a CountVectorizer for each text column.

And your snippet doesn't really solve the issue because no get_feature_names doesn't mean you can just use the column names.

yes, after a pandas DataFrame feeds in a preprocess pipeline, It's better to get feature names so that can know exactly what happened just from the generated data.

pjgao on 7 Nov 2018

👍1

ok, closing as duplicate.

amueller on 7 Nov 2018

This is not an issue about ColumnTransformer.

is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.
def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name
Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

I made a tiny enhancement to get back the name like rawname_value for onehot forms:

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        raw_col_name_reverse = raw_col_name[::-1]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
            exchange_name = [(_.split("_")) for _ in preprocessor.transformers_[:-1][0][1].steps[-1][1].get_feature_names()]
            last_pre_name = ""
            last_raw_name = ""
            for pre_name,value in exchange_name:
                if pre_name==last_pre_name:
                    col_name.append(last_raw_name+"_"+value)
                if pre_name!=last_pre_name:
                    last_pre_name=pre_name
                    last_raw_name=raw_col_name_reverse.pop()
                    col_name.append(last_raw_name+"_"+value)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

miemiekurisu on 21 May 2020

This is not an issue about ColumnTransformer.

is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.
def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name
Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.

nickcorona on 31 May 2020

What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.

It would be nice to have a get_feature_names method for this configuration.

kylegilde on 1 Jun 2020

What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.

Here is my contribution to the short-term solution. It coerces all the different array types to lists, and it handles the case of SimpleImputer(add_indicate=True). It's also a little more verbose.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        print('\n\ntransformer: ', transformer_in_columns[0])

        raw_col_name = list(transformer_in_columns[2])

        if isinstance(transformer_in_columns[1], Pipeline): 
            # if pipeline, get the last transformer
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]

        try:
          if isinstance(transformer, OneHotEncoder):
            names = list(transformer.get_feature_names(raw_col_name))

          elif isinstance(transformer, SimpleImputer) and transformer.add_indicator:
            missing_indicator_indices = transformer.indicator_.features_
            missing_indicators = [raw_col_name[idx] + '_missing_flag' for idx in missing_indicator_indices]

            names = raw_col_name + missing_indicators

          else:
            names = list(transformer.get_feature_names())

        except AttributeError as error:
          names = raw_col_name

        print(names)    

        col_name.extend(names)

    return col_name

kylegilde on 8 Jun 2020

👍4

FYI, I wrote some code and a blog about how to extract the feature names from complex Pipelines & ColumnTransformers. The code is an improvement over my previous post. https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4

kylegilde on 10 Sep 2020

👍4

@kylegilde Great article and thanks for the code. Works like a charm. For global explanations I had been wrestling with KernelSHAP and alibi for some hours but didn't get my onehot transformer working without handle_unkown='ignore'

jobvisser03 on 21 Sep 2020

👍1

Here is another version of the @pjgao's snippet that includes columns from reminder:

def get_columns_from_transformer(column_transformer, input_colums):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names(raw_col_name)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)

    [_, _, reminder_columns] = column_transformer.transformers_[-1]

    for col_idx in reminder_columns:
        col_name.append(input_colums[col_idx])

    return col_name

What do you think about adding similar function the the core codebase?