Scikit-learn: LabelBinarizer and LabelEncoder fit and transform signatures not compatible with Pipeline

Created on 26 Apr 2014  ·  6Comments  ·  Source: scikit-learn/scikit-learn

I get this error when I try to use LabelBinarizer and LabelEncoder in a Pipeline:

sklearn/pipeline.pyc in fit_transform(self, X, y, **fit_params)
    141         Xt, fit_params = self._pre_transform(X, y, **fit_params)
    142         if hasattr(self.steps[-1][-1], 'fit_transform'):
--> 143             return self.steps[-1][-1].fit_transform(Xt, y, **fit_params)
    144         else:
    145             return self.steps[-1][-1].fit(Xt, y, **fit_params).transform(Xt)

TypeError: fit_transform() takes exactly 2 arguments (3 given)

It seems like this is because the classes' fit and transform signatures are different from most other estimators and only accept a single argument.

I think this is a pretty easy fix (just change the signature to def(self, X, y=None)) that I'd be happy to send a pull request for, but I wanted to check if there were any other reasons that the signatures are the way they are that I didn't think of.

API

Most helpful comment

I see that there have been a lot of negative reactions on this page. I think there has been a long misunderstanding of the purpose of LabelBinarizer and LabelEncoder. These are for targets, not features. Although admittedly they were designed (and poorly named) before my time.

Although I think users could have been using CountVectorizer (or DictVectorizer with dataframe.to_dict(orient='records') if you're coming from a dataframe) for this purpose for a long time, we have recently merged a CategoricalEncoder (#9151) into master, although this may be rolled into OneHotEncoer, and a new OrdinalEncoder before release (#10521).

I hope this satisfies the needs of a clearly disgruntled populace.

I must say that as someone who has been volunteering enormous quantities of free time for the development of this project for nearly five years now (and recently has been employed to work on it too), seeing the magnitude of negative reactions, rather than constructive contributions to the library is quite saddening. Although admittedly my response above that you should write a new Pipeline-like thing, rather than a new transformer for categorical inputs was a misunderstanding on my part (and should/could have been corrected by others), which I hope is understandable while working through the enormous workload that is maintaining this project.

All 6 comments

I think you're right to fix that.

On 26 April 2014 19:37, hxu [email protected] wrote:

I get this error when I try to use LabelBinarizer and LabelEncoder in a
Pipeline:

sklearn/pipeline.pyc in fit_transform(self, X, y, *_fit_params)
141 Xt, fit_params = self._pre_transform(X, y, *_fit_params)
142 if hasattr(self.steps[-1][-1], 'fit_transform'):--> 143 return self.steps[-1][-1].fit_transform(Xt, y, *_fit_params)
144 else:
145 return self.steps[-1][-1].fit(Xt, y, *_fit_params).transform(Xt)
TypeError: fit_transform() takes exactly 2 arguments (3 given)

It seems like this is because the classes' fit and transform signatureshttps://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py#L85are different from most other estimators and only accept a single argument.

I think this is a pretty easy fix (just change the signature to def(self,
X, y=None)) that I'd be happy to send a pull request for, but I wanted to
check if there were any other reasons that the signatures are the way they
are that I didn't think of.


Reply to this email directly or view it on GitHubhttps://github.com/scikit-learn/scikit-learn/issues/3112
.

In #3113 we have decided this is not to be fixed because label encoding doesn't really belong in a Pipeline.

@jnothman, just to know: what should I be doing instead if I happen to need to vectorize a categorical feature in a pipeline?

You might be best off writing your own Pipeline-like code (perhaps inheriting from the existing) to handle your specific case.

Instead of using LabelBinarizer in a pipeline I just implemented my own transformer:

class CustomBinarizer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None,**fit_params):
        return self
    def transform(self, X):
        return LabelBinarizer().fit(X).transform(X)

Seems to do the trick!

edit:

this is a better solution:
https://github.com/scikit-learn/scikit-learn/pull/7375/files#diff-1e175ddb0d84aad0a578d34553f6f9c6

I see that there have been a lot of negative reactions on this page. I think there has been a long misunderstanding of the purpose of LabelBinarizer and LabelEncoder. These are for targets, not features. Although admittedly they were designed (and poorly named) before my time.

Although I think users could have been using CountVectorizer (or DictVectorizer with dataframe.to_dict(orient='records') if you're coming from a dataframe) for this purpose for a long time, we have recently merged a CategoricalEncoder (#9151) into master, although this may be rolled into OneHotEncoer, and a new OrdinalEncoder before release (#10521).

I hope this satisfies the needs of a clearly disgruntled populace.

I must say that as someone who has been volunteering enormous quantities of free time for the development of this project for nearly five years now (and recently has been employed to work on it too), seeing the magnitude of negative reactions, rather than constructive contributions to the library is quite saddening. Although admittedly my response above that you should write a new Pipeline-like thing, rather than a new transformer for categorical inputs was a misunderstanding on my part (and should/could have been corrected by others), which I hope is understandable while working through the enormous workload that is maintaining this project.

Was this page helpful?
0 / 5 - 0 ratings