When I use ColumnTransformer to preprocess different columns (include numeric, category, text) with pipeline, I cannot get the feature names of the final transformed data, which is hard for debugging.
Here is the code:
titanic_url = ('https://raw.githubusercontent.com/amueller/'
'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)
target = data.pop('survived')
numeric_columns = ['age','sibsp','parch']
category_columns = ['pclass','sex','embarked']
text_columns = ['name','home.dest']
numeric_transformer = Pipeline(steps=[
('impute',SimpleImputer(strategy='median')),
('scaler',StandardScaler()
)
])
category_transformer = Pipeline(steps=[
('impute',SimpleImputer(strategy='constant',fill_value='missing')),
('ohe',OneHotEncoder(handle_unknown='ignore'))
])
text_transformer = Pipeline(steps=[
('cntvec',CountVectorizer())
])
preprocesser = ColumnTransformer(transformers=[
('numeric',numeric_transformer,numeric_columns),
('category',category_transformer,category_columns),
('text',text_transformer,text_columns[0])
])
preprocesser.fit_transform(data)
preprocesser.get_feature_names()
will get error:AttributeError: Transformer numeric (type Pipeline) does not provide get_feature_names.
ColumnTransformer
,text_transformer
can only process a string (eg 'Sex'), but not a list of string as text_columns
This is not an issue about ColumnTransformer.
eli5
implements a feature names function that can support Pipeline.Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.
This is not an issue about ColumnTransformer.
- is about Pipeline. Note that
eli5
implements a feature names function that can support Pipeline.Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.
Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as OneHotEncoder
,CountVectorizer
, I can get the new data column names from pipeline last step's transformer by function get_feature_names
, when using methods which not create new columns, can just set the raw columns name.
def get_column_names_from_ColumnTransformer(column_transformer):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
raw_col_name = transformer_in_columns[2]
if isinstance(transformer_in_columns[1],Pipeline):
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
names = transformer.get_feature_names()
except AttributeError: # if no 'get_feature_names' function, use raw column name
names = raw_col_name
if isinstance(names,np.ndarray): # eg.
col_name += names.tolist()
elif isinstance(names,list):
col_name += names
elif isinstance(names,str):
col_name.append(names)
return col_name
Using above code, I can get my preprocesser
's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?
With respect to eli5, see transform_feature_names (used by explain_weights)
1 is a duplicate of #6425, right? I want to write a slep on that.
I think supporting multiple text columns is pretty easy with ColumnTransformer
. It's not the most pretty code but you could just add a CountVectorizer for each text column.
And your snippet doesn't really solve the issue because no get_feature_names
doesn't mean you can just use the column names.
1 is a duplicate of #6425, right? I want to write a slep on that.
I think supporting multiple text columns is pretty easy withColumnTransformer
. It's not the most pretty code but you could just add a CountVectorizer for each text column.And your snippet doesn't really solve the issue because no
get_feature_names
doesn't mean you can just use the column names.
yes, after a pandas DataFrame feeds in a preprocess pipeline, It's better to get feature names so that can know exactly what happened just from the generated data.
ok, closing as duplicate.
This is not an issue about ColumnTransformer.
- is about Pipeline. Note that
eli5
implements a feature names function that can support Pipeline.Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.
Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such asOneHotEncoder
,CountVectorizer
, I can get the new data column names from pipeline last step's transformer by functionget_feature_names
, when using methods which not create new columns, can just set the raw columns name.def get_column_names_from_ColumnTransformer(column_transformer): col_name = [] for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder' raw_col_name = transformer_in_columns[2] if isinstance(transformer_in_columns[1],Pipeline): transformer = transformer_in_columns[1].steps[-1][1] else: transformer = transformer_in_columns[1] try: names = transformer.get_feature_names() except AttributeError: # if no 'get_feature_names' function, use raw column name names = raw_col_name if isinstance(names,np.ndarray): # eg. col_name += names.tolist() elif isinstance(names,list): col_name += names elif isinstance(names,str): col_name.append(names) return col_name
Using above code, I can get my
preprocesser
's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?
I made a tiny enhancement to get back the name like rawname_value for onehot forms:
def get_column_names_from_ColumnTransformer(column_transformer):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
raw_col_name = transformer_in_columns[2]
raw_col_name_reverse = raw_col_name[::-1]
if isinstance(transformer_in_columns[1],Pipeline):
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
names = transformer.get_feature_names()
exchange_name = [(_.split("_")) for _ in preprocessor.transformers_[:-1][0][1].steps[-1][1].get_feature_names()]
last_pre_name = ""
last_raw_name = ""
for pre_name,value in exchange_name:
if pre_name==last_pre_name:
col_name.append(last_raw_name+"_"+value)
if pre_name!=last_pre_name:
last_pre_name=pre_name
last_raw_name=raw_col_name_reverse.pop()
col_name.append(last_raw_name+"_"+value)
except AttributeError: # if no 'get_feature_names' function, use raw column name
names = raw_col_name
if isinstance(names,np.ndarray): # eg.
col_name += names.tolist()
elif isinstance(names,list):
col_name += names
elif isinstance(names,str):
col_name.append(names)
return col_name
This is not an issue about ColumnTransformer.
- is about Pipeline. Note that
eli5
implements a feature names function that can support Pipeline.Re 2. perhaps you're right that it's unfriendly that we don't have a clean way to apply a text vectorizer to each column. I'm not sure how that can be cleanly achieved, unless we simply start supporting multiple columns of input in CountVectorizer etc.
Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such asOneHotEncoder
,CountVectorizer
, I can get the new data column names from pipeline last step's transformer by functionget_feature_names
, when using methods which not create new columns, can just set the raw columns name.def get_column_names_from_ColumnTransformer(column_transformer): col_name = [] for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder' raw_col_name = transformer_in_columns[2] if isinstance(transformer_in_columns[1],Pipeline): transformer = transformer_in_columns[1].steps[-1][1] else: transformer = transformer_in_columns[1] try: names = transformer.get_feature_names() except AttributeError: # if no 'get_feature_names' function, use raw column name names = raw_col_name if isinstance(names,np.ndarray): # eg. col_name += names.tolist() elif isinstance(names,list): col_name += names elif isinstance(names,str): col_name.append(names) return col_name
Using above code, I can get my
preprocesser
's column names.
Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?
What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.
What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.
It would be nice to have a get_feature_names method for this configuration.
What about if you apply simpleimputer with add_indicator in a pipeline? This approach won't work.
Here is my contribution to the short-term solution. It coerces all the different array types to lists, and it handles the case of SimpleImputer(add_indicate=True). It's also a little more verbose.
def get_column_names_from_ColumnTransformer(column_transformer):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
print('\n\ntransformer: ', transformer_in_columns[0])
raw_col_name = list(transformer_in_columns[2])
if isinstance(transformer_in_columns[1], Pipeline):
# if pipeline, get the last transformer
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
if isinstance(transformer, OneHotEncoder):
names = list(transformer.get_feature_names(raw_col_name))
elif isinstance(transformer, SimpleImputer) and transformer.add_indicator:
missing_indicator_indices = transformer.indicator_.features_
missing_indicators = [raw_col_name[idx] + '_missing_flag' for idx in missing_indicator_indices]
names = raw_col_name + missing_indicators
else:
names = list(transformer.get_feature_names())
except AttributeError as error:
names = raw_col_name
print(names)
col_name.extend(names)
return col_name
FYI, I wrote some code and a blog about how to extract the feature names from complex Pipelines & ColumnTransformers. The code is an improvement over my previous post. https://towardsdatascience.com/extracting-plotting-feature-names-importance-from-scikit-learn-pipelines-eb5bfa6a31f4
@kylegilde Great article and thanks for the code. Works like a charm. For global explanations I had been wrestling with KernelSHAP and alibi for some hours but didn't get my onehot transformer working without handle_unkown='ignore'
Here is another version of the @pjgao's snippet that includes columns from reminder:
def get_columns_from_transformer(column_transformer, input_colums):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]: #the last transformer is ColumnTransformer's 'remainder'
raw_col_name = transformer_in_columns[2]
if isinstance(transformer_in_columns[1],Pipeline):
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
names = transformer.get_feature_names(raw_col_name)
except AttributeError: # if no 'get_feature_names' function, use raw column name
names = raw_col_name
if isinstance(names,np.ndarray): # eg.
col_name += names.tolist()
elif isinstance(names,list):
col_name += names
elif isinstance(names,str):
col_name.append(names)
[_, _, reminder_columns] = column_transformer.transformers_[-1]
for col_idx in reminder_columns:
col_name.append(input_colums[col_idx])
return col_name
What do you think about adding similar function the the core codebase?
Most helpful comment
Thanks for your kind reply!
As I know, when I preprocess a column using methods which can change one column to multi-columns such as
OneHotEncoder
,CountVectorizer
, I can get the new data column names from pipeline last step's transformer by functionget_feature_names
, when using methods which not create new columns, can just set the raw columns name.Using above code, I can get my
preprocesser
's column names.Is these code solve this question?
As of eli5, I do not find that function, Can you give me a link for the explicit example or api for eli5?