Scikit-learn: Cannot get feature names after ColumnTransformer

Created on 6 Nov 2018  ·  13Comments  ·  Source: scikit-learn/scikit-learn

When I use ColumnTransformer to preprocess different columns (include numeric, category, text) with pipeline, I cannot get the feature names of the final transformed data, which is hard for debugging.

Here is the code:

titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')

data = pd.read_csv(titanic_url)

target = data.pop('survived')

numeric_columns = ['age','sibsp','parch']
category_columns = ['pclass','sex','embarked']
text_columns = ['name','home.dest']

numeric_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='median')),
    ('scaler',StandardScaler()
    )
])
category_transformer = Pipeline(steps=[
    ('impute',SimpleImputer(strategy='constant',fill_value='missing')),
    ('ohe',OneHotEncoder(handle_unknown='ignore'))
])
text_transformer = Pipeline(steps=[
    ('cntvec',CountVectorizer())
])

preprocesser = ColumnTransformer(transformers=[
    ('numeric',numeric_transformer,numeric_columns),
    ('category',category_transformer,category_columns),
    ('text',text_transformer,text_columns[0])
])

preprocesser.fit_transform(data)
  1. preprocesser.get_feature_names() will get error:
    AttributeError: Transformer numeric (type Pipeline) does not provide get_feature_names.
  2. In ColumnTransformertext_transformer can only process a string (eg 'Sex'), but not a list of string as text_columns

Most helpful comment

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

All 13 comments

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

With respect to eli5, see transform_feature_names (used by explain_weights)

1 is a duplicate of #6425, right? I want to write a slep on that.
I think supporting multiple text columns is pretty easy with ColumnTransformer. It's not the most pretty code but you could just add a CountVectorizer for each text column.

And your snippet doesn't really solve the issue because no get_feature_names doesn't mean you can just use the column names.

1 is a duplicate of #6425, right? I want to write a slep on that.
I think supporting multiple text columns is pretty easy with ColumnTransformer. It's not the most pretty code but you could just add a CountVectorizer for each text column.

And your snippet doesn't really solve the issue because no get_feature_names doesn't mean you can just use the column names.

yes, after a pandas DataFrame feeds in a preprocess pipeline, It's better to get feature names so that can know exactly what happened just from the generated data.

ok, closing as duplicate.

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

I made a tiny enhancement to get back the name like rawname_value for onehot forms:

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        raw_col_name_reverse = raw_col_name[::-1]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
            exchange_name = [(_.split("_")) for _ in preprocessor.transformers_[:-1][0][1].steps[-1][1].get_feature_names()]
            last_pre_name = ""
            last_raw_name = ""
            for pre_name,value in exchange_name:
                if pre_name==last_pre_name:
                    col_name.append(last_raw_name+"_"+value)
                if pre_name!=last_pre_name:
                    last_pre_name=pre_name
                    last_raw_name=raw_col_name_reverse.pop()
                    col_name.append(last_raw_name+"_"+value)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

This is not an issue about ColumnTransformer.

  1. is about Pipeline. Note that eli5 implements a feature names function that can support Pipeline.

Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.

Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder,CountVectorizer, I can get the new data column names from pipeline last step's transformer by function get_feature_names, when using methods which not create new columns, can just set the raw columns name.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

Using above code, I can get my preprocesser 's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?

What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.

What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.

It would be nice to have a get_feature_names method for this configuration.

What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.

Here is my contribution to the short-term solution. It coerces all the different array types to lists, and it handles the case of SimpleImputer(add_indicate=True). It's also a little more verbose.

def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        print('\n\ntransformer: ', transformer_in_columns[0])

        raw_col_name = list(transformer_in_columns[2])

        if isinstance(transformer_in_columns[1], Pipeline): 
            # if pipeline, get the last transformer
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]

        try:
          if isinstance(transformer, OneHotEncoder):
            names = list(transformer.get_feature_names(raw_col_name))

          elif isinstance(transformer, SimpleImputer) and transformer.add_indicator:
            missing_indicator_indices = transformer.indicator_.features_
            missing_indicators = [raw_col_name[idx] + '_missing_flag' for idx in missing_indicator_indices]

            names = raw_col_name + missing_indicators

          else:
            names = list(transformer.get_feature_names())

        except AttributeError as error:
          names = raw_col_name

        print(names)    

        col_name.extend(names)

    return col_name

FYI, I wrote some code and a blog about how to extract the feature names from complex Pipelines & ColumnTransformers. The code is an improvement over my previous post. https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4

@kylegilde Great article and thanks for the code. Works like a charm. For global explanations I had been wrestling with KernelSHAP and alibi for some hours but didn't get my onehot transformer working without handle_unkown='ignore'

Here is another version of the @pjgao's snippet that includes columns from reminder:

def get_columns_from_transformer(column_transformer, input_colums):    
    col_name = []

    for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names(raw_col_name)
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)

    [_, _, reminder_columns] = column_transformer.transformers_[-1]

    for col_idx in reminder_columns:
        col_name.append(input_colums[col_idx])

    return col_name

What do you think about adding similar function the the core codebase?

Was this page helpful?
0 / 5 - 0 ratings