Scikit-learn: Rethinking the CategoricalEncoder API ?

Created on 23 Jan 2018 · 63Comments · Source: scikit-learn/scikit-learn

Based on some discussions we are having here and issues that are opened, we are having some doubts that CategoricalEncoder (https://github.com/scikit-learn/scikit-learn/pull/9151) was the good choice of name (and since it is not released yet we have some room for change).

So summary of how it is now:

The class name CategoricalEncoder says what type of data it accepts (categorical data)
The keyword argument encoding specifies how to encode those data

Currently we already have encoding='onehot'|'onehot-dense'|'ordinal'.

But what to do in the following cases:

We want to add more encoding options (eg binary encoding, mean target encoding, unary encoding, ...). Do we keep adding those as new values for the encoding kwarg in the one big CategoricalEncoder class?
We want to add an option specific to one of the encodings (eg for 'onehot' encoding to drop the first (redundant) column, or for 'ordinal' encoding base the order of the categories on the frequency, ...). The problem here is that we then need to add additional keyword arguments to CategoricalEncoder that are or are not active depending on what you passed for encoding kwarg, which is not the nicest API design.

For that last problem, we already had this with the sparse=True/False option, which was only relevant for 'onehot' and not for 'ordinal', and which we solved with having both 'onehot' and 'onehot-dense' encoding options and not a sparse keyword. But such an approach also does not scale.

Related to this, there is a PR to add a UnaryEncoder (https://github.com/scikit-learn/scikit-learn/pull/8652). There was a related discussion on the naming in that PR, as currently the name says how it encodes, not what type of data it gets (in the current design, it accepts already encoded integers, not actual categorical data. In that regard, to be consistent with CategoricalEncoder, it might better be named OrdinalEncoder because it needs ordinal data as input).

What are the options forward:

1) Keep things as we have it now in master, and be be OK with adding some new options to the single class (an important question which is hard to answer now, is how much new features we will want to add in the future).
2) Switch the naming scheme and have a bunch of 'categorical encoders' where the name says how it encodes (OnehotEncoder, OrdinalEncoder, and later maybe BinaryEncoder, UnaryEncoder, ...)

So it is a bit a trade-off of potential build up of number of classes vs number of keyword arguments in a single class.

One problem with the second approach (and one of the reasons we went with CategoricalEncoder in the first place, even before we added the multiple encoding options), is that there is already a OnehotEncoder, which has a different API than the CategoricalEncoder. And, there is not really a good other name we could use for the encoder that does one-hot encoding.
However, I think that, with some temporary ugly hacks, we could reuse the name, if we are OK with deprecating the current attributes (and I think we agree it are not the most useful attributes). The idea would be that if you fit the class with string data, you get the new behaviour, and if you fit the class with integer data, you get a deprecation warning indicating the default behaviour will change (and indicating which keyword to specify to get rid of the warning).

cc @jnothman @amueller @GaelVaroquaux @rth

Source

jorisvandenbossche

Most helpful comment

The idea of reverting CategoricalEncoder makes me quite sad, but I think
you're right that future users would be less mystified by option 2. My main
concern is that we have tried implementing this as a change to OHE for a
long time and it never flew. Perhaps it would be good to attempt the
modifications to the OneHotEncoder docstring according to that proposed
change, so we can see if it looks sane.

jnothman on 23 Jan 2018

👍6

All 63 comments

Thanks for the summary @jorisvandenbossche. I think I am in favor of option 2: reuse OneHotEncoder class, deprecate the weird attributes and add a constructor parameter to select the behavior with a future warning that says that the default behavior will change but makes it easy to silence that warning just by passing a value for that option.

ogrisel on 23 Jan 2018

👍2

jnothman on 23 Jan 2018

👍6

+1 to what Joel said

⁣Sent from my phone. Please forgive typos and briefness.

On Jan 23, 2018, 12:28, at 12:28, Joel Nothman notifications@github.com wrote:

The idea of reverting CategoricalEncoder makes me quite sad, but I
think
you're right that future users would be less mystified by option 2. My
main
concern is that we have tried implementing this as a change to OHE for
a
long time and it never flew. Perhaps it would be good to attempt the
modifications to the OneHotEncoder docstring according to that proposed
change, so we can see if it looks sane.

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/scikit-learn/scikit-learn/issues/10521#issuecomment-359761818

GaelVaroquaux on 23 Jan 2018

The idea of reverting CategoricalEncoder makes me quite sad

To be clear, it would not be a revert, it would be a refactor / rename that keeps all functionality!
But I also like the "CategoricalEncoder" name, that would indeed be sad.

That said, I will quickly try to do the changes to have an idea how possible it is to integrate this in OnehotEncoder.

jorisvandenbossche on 23 Jan 2018

OK, I opened a PR with a proof of concept: https://github.com/scikit-learn/scikit-learn/pull/10523.
It's not yet complete (no deprecation warnings and new attributes are not yet calculated in old behaviour).

The main API question is about the format of the input data.
So as a recap, there are two different ways we currently process the categorical data:

1) As actual, not yet encoded (integer or string), categorical data (how it is done in CategoricalEncoder) -> infer categories from the unique values in the training data
2) As integer, already encoded data (how it is done in the current OneHotEncoder) -> infer categories from the maximum value in the training data

The question is: do we find both cases worth supporting? Thus, in the potentially merged OneHotEncoder, do we keep the ability to do both, or do we fully deprecate and then remove the ability to process ordinal input?

If want the ability to process both, we can add a boolean keyword to specify the input data type (for now I use encoded_input=False/True, but other ideas are ordinal_input, ...)

For the deprecation period, we have to support both anyway, and also have to introduce a keyword to choose the behaviour (to be able to silence the warning and choose the new behaviour).
So in principle we could just keep the keyword afterwards.

Given that we want to handle both, an overview of how OneHotEncoder would work:

for now encoded_input=None, and we infer the default based on the data
if int-like data (handled before by OneHotEncoder) encoded_input is internally set to True and a deprecation warning is raised. If the user wants to keep the current behaviour, it can manually specify it as OneHotEncoder(encoded_input=True) to silence the warning.
if the input is not int-like, we set encoded_input internally to False and without warning use the new behaviour (= the current CategoricalEncoder behaviour)
in the future we change the default of encoded_input from None to False (by default the new behaviour, also for int-like data)

jorisvandenbossche on 23 Jan 2018

I'm still not sure what you're suggesting is the practical difference due to inferring categories from the max value.

jnothman on 24 Jan 2018

@jnothman I suppose you acknowledge there can be a difference in practice? (the output you get depending on the data you have)

But whether this difference is important in practice, I don't know. That's where I would like to see feedback. Whether anybody actually wants this "max value"-based method, or whether we are fine with (in the future, after deprecation) only having the "unique values"-based method.

I think I personally would never need this max-value based method, but the OneHotEncoder has been like that for many years (for good reason or not?).

Actually deprecating the max-value based categorization would certainly make the implementation (after deprecation) simpler.
And if we choose for that route, I agree the option should rather be legacy_mode=True/False rather than encoded_input/ordinal_input

jorisvandenbossche on 24 Jan 2018

Remind me what the actual difference in output is, when n_values='auto',
please? I had thought the active_features_ thing made them basically
identical, but I'm probably forgetting something.

jnothman on 24 Jan 2018

Aha, that clarifies our misunderstanding :-)
I misunderstood how the current OneHotEncoder is actually working. Suppose you have one feature with values [2, 3, 5, 2]. I thought the current OneHotEncoder would have categories [0, 1, 2, 3, 4, 5] (while current CategoricalEncoder would have categories [2, 3, 5]). But you are right that the active_features_ is also only [2, 3, 5], essentially making them the same with the default value of n_values='auto'.

So it is only the case where you pass an integer to n_values (like n_values=6 for categories=[0, 1, 2, 3, 4, 5] in the above case) to specify the number of categories that will actually be an API change (deprecated / removed).
And that will be easily be replacable by the user with categories=range(6)

Sorry for the confusion.
In that light, I think we even don't need the legacy_mode option. We can just translate n_values=6 to categories=range(6) internally and raise a warning for that (but need to check this with the actual tests).

The other difference is the handling of unseen categories. With the current behaviour of the OneHotEncoder, if the unseen values are within the range(0, max), it will not raise an error even if handle_unknow='error' (the default). But also that can be solved separately by in such a case raising a warning that the user should set handle_unknown='ignore' manually to keep the existing behaviour.

The only feature we would loose is the distinction between unknown categories that are within the range(0, max) (by the current OneHotEncoder not regarded as 'unknown') and those that are bigger than that (> max, those are currently already regarded as unknown by the OneHotEncoder).

jorisvandenbossche on 24 Jan 2018

no, that is the sort of thing we have tried before and it's just too
finicky. unless there is good reason to maintain current behaviour, we
should just have a legacy_mode to slowly bring us to the future.

jnothman on 24 Jan 2018

no, that is the sort of thing we have tried before and it's just too finicky.

Can you clarify to which aspect this "no" refers?
To the fact that I think a legacy_mode is not needed?

jorisvandenbossche on 24 Jan 2018

yes, to the idea that you can just make something that is both backwards
compatible and what we want going forward

jnothman on 24 Jan 2018

yes, to the idea that you can just make something that is both backwards compatible and what we want going forward

That was not what I tried to suggest. I wanted to make clear that think it is possible to not have a legacy_mode keyword, not by having it magically both backwards compat and what we want in the future, but by deprecating the behaviour of the existing keywords.

So to be concrete: a non-default value of n_values can be deprecated and has to be replaced by categories specification. handle_unknow in case of integer data should be set explicitly by the user to choose either full ignoring or full erroring instead of current mix (and otherwise deprecation warning is raised).

jorisvandenbossche on 24 Jan 2018

so if I do .fit([[5]]).transform([[4]]), for which values of n_values,
categories and handle_umknown will that raise an error?

On 25 Jan 2018 9:32 am, "Joris Van den Bossche" notifications@github.com
wrote:

yes, to the idea that you can just make something that is both backwards
compatible and what we want going forward

That was not what I tried to suggest. I wanted to make clear that think it
is possible to not have a legacy_mode keyword, not by having it magically
both backwards compat and what we want in the future, but by deprecating
the behaviour of the existing keywords.

So to be concrete: a non-default value of n_values can be deprecated and
has to be replaced by categories specification. handle_unknow in case of
integer data should be set explicitly by the user to choose either full
ignoring or full erroring instead of current mix (and otherwise deprecation
warning is raised).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10521#issuecomment-360296569,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-DrQWep22_gs-hg9cC0u19B1_PSks5tN6-HgaJpZM4RpUE8
.

jnothman on 25 Jan 2018

can we just make it so that during deprecation, categories must be set
explicitly, and legacy mode with warnings is otherwise in effect? Is that
what you are suggesting?

jnothman on 25 Jan 2018

can we just make it so that during deprecation, categories must be set explicitly, and legacy mode with warnings is otherwise in effect? Is that what you are suggesting?

Yes, it might be still missing case, but I think this is possible (will check by actual coding it next week).

The different 'legacy' cases:

n_values='auto' (the default)
- handle_unknown='ignore' -> fine, no change in behaviour
- handle_unknown='error' -> Problem, values in range are still ignored, values above range error
  - Possible solution:
    - in fit, if the range is consecutive => fine, no change in behaviour (for all people that now combined LabelEncoder with it, which is a typical use case I think)
    - if this is not the case: raise deprecation warning that they have to set categories explicitly to keep this behaviour (and internally use legacy mode)
n_values=value
- this can be translated to categories=[range(value)] internally, and raise deprecation warning that user should do that themselves in the future
- in this case handle_unknown='error' / 'ignore' work as expected

The deprecation warning in case of n_values='auto' will only be raise in fit and not upon construction (which is not really ideal), but it is only in fit that we know that the user is passing it numeric data and not string data.

jorisvandenbossche on 29 Jan 2018

we don't usually raise warnings until fit in any case so don't worry about
that.

that strategy sounds mostly good.

I'm not actually sure if we should be sniffing for strings in the data,
though. You basically want it to be: legacy mode is active if categories is
not set and if the data is all integers?

One question: if categories and n_values parameters are their default, do
we publish categories_? If n_values is set explicitly, do we publish
categories_?

On 29 Jan 2018 10:00 am, "Joris Van den Bossche" notifications@github.com
wrote:

can we just make it so that during deprecation, categories must be set
explicitly, and legacy mode with warnings is otherwise in effect? Is that
what you are suggesting?

Yes, it might be still missing case, but I think this is possible (will
check by actual coding it next week).

The different 'legacy' cases:

n_values='auto' (the default)

handle_unknown='ignore' -> fine, no change in behaviour

handle_unknown='error' -> Problem, values in range are still

ignored, values above range error

Possible solution:

in fit, if the range is consecutive => fine, no change in

behaviour (for all people that now combined LabelEncoder with it, which is

a typical use case I think)

if this is not the case: raise deprecation warning that

they have to set categories explicitly to keep this behaviour (and

internally use legacy mode)

n_values=value

this can be translated to categories=[range(value)] internally,

and raise deprecation warning that user should do that themselves in the

future

in this case handle_unknown='error' / 'ignore' work as expected

The deprecation warning in case of n_values='auto' will only be raise in
fit and not upon construction (which is not really ideal), but it is only
in fit that we know that the user is passing it numeric data and not string
data.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10521#issuecomment-361104495,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6x8xnyZXBLij-DCC45JyYNf8pA5kks5tPPwXgaJpZM4RpUE8
.

jnothman on 29 Jan 2018

You basically want it to be: legacy mode is active if categories is not set and if the data is all integers?

Yes indeed (in practice it will more or less be the same)

One question: if categories and n_values parameters are their default, do we publish categories_? If n_values is set explicitly, do we publish categories_?

I personally would already as much as possible provide the attributes of the new interface, even in legacy mode. So in both case I would calculate categories_ (even if it would be a bit more work)

So I tried to put the above logic in code (will push some updates to the PR), and I have one more question for the case of integer data when n_values or categories is not set (typical case for 'legacy_mode'). The problem lies in the fact that if the inferred categories are simply a consecutive range (0, 1, 2, 3, ... max), there is no difference between the new and old (legacy) behaviour, and we don't necessarily need to raise a deprecation warning.
Some possibilities to do in this specific case:

1) Detect this case (that the inferred categories are a consecutive range), and in that case don't raise a warning.
- This is possible to detect (with a little bit of extra code complexity) as we are already in fit anyhow
- I think this will be a common case when using OneHotEncoder with integer data, and a case where the user actually does not need to worry about our refactoring, so it would be nice to not bother him/her with a warning
2) Always raise a warning, and indicate in the warning message what to do if you are in such a case (in addition to an explanation what to do if you don't have a consecutive range):
- If they know they have only consecutive ranges as categories, they want to ignore the warning, so we can add to the warning message an explanation how to do this (add a code sample with filterwarnings they can copy paste)
- A potential advantage of this is that we can also add to the warning message that if they used the LabelEncoder to create the integers, they can now directly use OneHotEncoder (I think this currently is a typical usage pattern). That way, the warning will also go away
3) Always raise a warning but provide a keyword to silence it (eg legacy_mode=False)
- If we find the advice to use a filterwarnings statement (see point 2 above) too cumbersome, we could also add a keyword to obtain the same result
- Disadvantage of this is introducing a keyword that will not be needed anymore in a few releases when the deprecations are cleaned up.

I am personally in favor of option 1 or 2. Using the LabelEncoder before OneHotEncoder seems to be a typical pattern (from a quick github search), and in those case you always have consecutive ranges, and there will never be a change in behaviour with the new implementation, so we shouldn't warn for it. On the other hand, if we warn we can point them to the fact that if they used LabelEncoder, they no longer need to do it. Which would be nice to actually give this advice explicitly.
The question is how frequently users have such consecutive integers as categories without having used LabelEncoder as the previous step ..

jorisvandenbossche on 1 Feb 2018

Hmm, one case I forgot is when you have integer inferred categories that are not consecutive (let's say [1,3,5]), but you want the new behaviour and not legacy behaviour (so in that case you cannot just ignore the warning, as that would handle unseen values differently in the transform step, i.e. values in between the range (eg 2) will not raise an error).
In case we don't provide the legacy_mode=False keyword, the only way to obtain the new behaviour is by manually passing categories=[1,3,5], which can be a slight inconvenience. That might be a reason to favor option 3 and give up my objection on introducing a temporary keyword legacy_mode=False (but also not fully sure it is worth it, as this would be the only case* where such a keyword is actually needed)

* this only case = integer data with inferred categories that are not consecutive range, and where you cannot / don't want to set the categories manually or set handle_unknown to ignore.

Sorry for all the long text, but it's quite complex :)

jorisvandenbossche on 1 Feb 2018

We're only talking about the case where n_values is unset, right?

I'm fine with 1., and it would not be any more expensive, since auto
already needs to examine the set of labels. I could also accept, for
simplicity, a variant of 3. that was just "OneHotEncoder running in legacy
mode. Set categories='auto' for slightly different behaviour without a
warning."

jnothman on 1 Feb 2018

We're only talking about the case where n_values is unset, right?

Yes (the other case easily be translated in its equivalent categories value, with a nice deprecation warning, and without different in new and legacy behaviour)

a variant of 3. that was just "OneHotEncoder running in legacy mode. Set categories='auto' for slightly different behaviour without a warning."

Ah, that sounds like a good idea! (irregardless of whether detecting the consecutive categories case or not). So we set in the code the default of categories to None (without changing the semantics of its default), so we know if the user set it explicitly, and in that way it is a nice way to indicate legacy_mode=False without needing that extra keyword.

jorisvandenbossche on 1 Feb 2018

Yes, but only if we want to warn every time someone uses it without passing
categories. It's the cheap implementation approach, but it might be
unnecessarily verbose for the users, which is why I would prefer 1 if it
can be done simply.

jnothman on 1 Feb 2018

What fresh hell is this :-/

amueller on 7 Feb 2018

OR we could name the new one DummyEncoder ;) (though that is a bit conflicting with the DummyClassifier)

amueller on 7 Feb 2018

@amueller Don't read all of the above!
I was just planning to make a nice summary for new readers of the issue. The above discussion is overly complicated (also because I was still not fully understanding the current complex behaviour of OneHotEncoder ... :-))

jorisvandenbossche on 7 Feb 2018

OR we could name the new one DummyEncoder ;)

I think @GaelVaroquaux was against that because "one-hot" is known to be this in more fields (and we already use 'Dummy' for other things in scikit-learn ...)

jorisvandenbossche on 7 Feb 2018

Redoing this for consistency in naming is not worth it imho. We are not consistent in naming anywhere. Could you summarize the discussions that lead to this?

amueller on 7 Feb 2018

I think "dummy" is what statisticians use and it's what pandas uses.

amueller on 7 Feb 2018

The top post is still accurate and worth a read, and it summarizes the reasoning for not keeping CategoricalEncoder (which does not mean that we need to use OneHotEncoder instead of eg DummyEncoder, that's a separate question)

jorisvandenbossche on 7 Feb 2018

I read the top post. That's what I referred to when I said "redoing this for consistency is not worth it".

issues that are opened

Can you explain that?

amueller on 7 Feb 2018

"redoing this for consistency is not worth it"

With consistency, are you pointing to the naming scheme of "what it accepts" vs "what it does" ? If so, that was only a minor reason. For me it is mainly a question of scalability in adding more features to a single class.

issues that are opened

We had the issue about how to handle missing values (https://github.com/scikit-learn/scikit-learn/issues/10465), and for this you could want different behaviour for ordinal and one-hot encoding (or not all options are valid for both, ..). We also already have the existing handle_unknown which is only relevant for one-hot encoding and not for ordinal. And there was https://github.com/scikit-learn/scikit-learn/issues/10518 about feature weighting for onehot encoding, but also not relevant for ordinal (this issue in the end was not a problem, as you can do the weighting with the ColumnTransformer transformer_weights argument). And we also have the feature request to add something like drop_first for one-hot, which is again not relevant for ordinal encoding.

jorisvandenbossche on 7 Feb 2018

I don't see how the proposed change would help with the missing values that much. And having incompatible options is something that happens often in scikit-learn. Not ideal, but also not a big deal imho.

amueller on 7 Feb 2018

I don't see how the proposed change would help with the missing values that much.

It does not help as such, but it makes it less complex to have specific options specifically tailored to the different encoding types.

And having incompatible options is something that happens often in scikit-learn. Not ideal, but also not a big deal imho.

Currently it is certainly still OK, there are not too many incompatible options (but also partly because I moved sparse=True/False into the encoding option). But the question is to what extent we want to expand the encoding functionality in scikit-learn in the future. Which is of course a difficult question to answer now.
We already have a PR for 'unary encoding'. Shouldn't this be rather added to CategoricalEncoder instead of adding a new class UnaryEncoder? And what if someone wants to add a 'binary encoding'? Or a '(mean) target encoder'?

jorisvandenbossche on 7 Feb 2018

The "mean target encoder" is CountTransformer, there's a PR for that ;)

amueller on 7 Feb 2018

Do you have a link for that? Searching for "CountTransformer" does not give any results

jorisvandenbossche on 7 Feb 2018

Sorry, CountFeaturizer #9614

amueller on 7 Feb 2018

It's certainly related, but not exactly a mean target encoding. Also, it adds columns, not replaces, so will not yet work out of the box for string categorical data (but that is more feedback on that PR, not to discuss here).

jorisvandenbossche on 7 Feb 2018

Why is it not mean target encoding? But yeah let's not divert too much here ;)

amueller on 7 Feb 2018

So as a summary of the actual questions we need to answer (in this order!):

Do we keep the current CategoricalEncoder? If not, the idea is to split it in different classes, one class for each type of encoding (currently 'onehot' and 'ordinal' encoding).
If we split in multiple classes, we could (ideally?) use OneHotEncoder for the 'onehot' encoding, but this class already exists. So, do we integrate the new 'onehot' encoding (which supports strings and has different parameters) in the existing OneHotEncoder class? Or do we choose another name? (eg DummyEncoder)
If we choose to integrate into the existing OneHotEncoder, are we OK with the following consequences: we deprecate a bunch of the keywords/attributes of OneHotEncoder, and a specific usecase (automatically ignoring unseen values within the range of seen values) will not be possible anymore after deprecation period.

Most of the discussion above was about question 3 (the complex details of how to integrate CategoricalEncoder(encoding='onehot') into OneHotEncoder). But let's first agree on a decision for the first 2 questions.

jorisvandenbossche on 7 Feb 2018

the other factor for me is that everyone thinks the current auto mode in
OneHotEncoder is weird. its implementation converting coo to csr is also
weird. it deserves a redesign. and telling people "if you want to one hot
encode strings, go to CategoricalEncoder instead" is awkward, because OHE
is already intended for categoricals...

jnothman on 7 Feb 2018

hrm. I guess we kept OneHotEncoder because it's more efficient when it can be used.... Ideally we would get rid of all the weird behaviors. I kinda had wanted to deprecate it but then we didn't...

amueller on 7 Feb 2018

I kinda had wanted to deprecate it but then we didn't...

In my POC PR (https://github.com/scikit-learn/scikit-learn/pull/10523), I deprecated almost everything of OneHotEncoder, except its name ...

jorisvandenbossche on 7 Feb 2018

It's not much more efficient. And if LabelEncoder had fast paths for ints
in range [0, n_values-1], if justified, that would be good enough.

jnothman on 8 Feb 2018

@amueller, are you persuaded by the issue that we ultimately want different additional parameters (e.g. drop_first, nan handling) depending on the encoding, and that justifies having a different discrete encoder for each encoding format?

jnothman on 12 Feb 2018

I'll try to look at this in the spring break in two weeks, ok? not sure if I'll have time before that :-/

amueller on 23 Feb 2018

I hope this isn't the wrong place to ask but what does the current implementation do with tables that are mixed categorical and non-categorical within one column? Taking the example from https://github.com/pandas-dev/pandas/issues/17418

Consider the dataframe df = pd.DataFrame([{'apple': 1, 'pear':'a', 'carrot': 1}, {'apple':'a', 'pear':2, 'carrot':3}, {'apple': 2, 'pear':3, 'carrot':1}, {'apple': 3, 'pear':'b', 'carrot': 1}, {'apple': 4, 'pear':4, 'carrot': 1}]) which equals:

  apple  carrot pear
0     1       1    a
1     a       3    2
2     2       1    3
3     3       1    b
4     4       1    4

DictVectorizer gives exactly what I need in this case.

    from sklearn.feature_extraction import DictVectorizer
    enc = DictVectorizer(sparse = False)
    enc.fit_transform(df.to_dict(orient='r'))

This gives:

array([[ 1.,  0.,  1.,  0.,  1.,  0.],
       [ 0.,  1.,  3.,  2.,  0.,  0.],
       [ 2.,  0.,  1.,  3.,  0.,  0.],
       [ 3.,  0.,  1.,  0.,  0.,  1.],
       [ 4.,  0.,  1.,  4.,  0.,  0.]])

We can see the features names of the columns with:

    enc.feature_names_
    ['apple', 'apple=a', 'carrot', 'pear', 'pear=a', 'pear=b']

It would be great if the new CategoricalEncoder had an option to do the same.

lesshaste on 24 Feb 2018

I don't think we intend to handle that kind of mixed case

jnothman on 24 Feb 2018

That’s a shame. One simple sub case is where a column is numerical but has some missing values. A simple solution is to convert the NaNs into empty strings and then use DictVectorizer as in my example above. This effectively creates a new feature for when the value is missing but leaves the numerical values unchanged otherwise. I have found this a very useful technique.

Will the new CategoricalEncoder be able to do something similar?

lesshaste on 24 Feb 2018

we've considered allowing users to have NaN treated as a separate category
or similar. but that's not the same as handling arbitrary numeric values as
different from strings.

jnothman on 25 Feb 2018

That sounds good.

You are right there are two use cases. Let me explain a particular example of where treating numeric values as different from strings has been useful for me. It may be that there is a better solution.

Say you have an integer numeric feature which takes a large range of values. However you suspect that for some small values, the precise value is significant. For larger values you suspect this isn’t the case. A simple thing to do is to convert all small values to strings, run DictVectorizer as above and then perform feature selection or just use your favorite classifier directly.

lesshaste on 25 Feb 2018

So you're using it for a non-linear discretisation? The next release is
likely to include a fixed-width discretizer, but following on from a log
transform or a quantile transform it should act quite similar to what you
want... But the log transform might alone be sufficient in your setting.

On 25 February 2018 at 18:10, lesshaste notifications@github.com wrote:

That sounds good.

You are right there are two use cases. Let me explain a particular example
of where treating numeric values as different from strings has been useful
for me. It may be that there is a better solution.

Say you have an integer numeric feature which takes a large range of
values. However you suspect that for some small values, the precise value
is significant. For larger values you suspect this isn’t the case. A simple
thing to do is to convert all small values to strings, run DictVectorizer
as above and then perform feature selection or just use your favorite
classifier directly.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10521#issuecomment-368288727,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60cmjwlDVKGyXc6oPyIC9oLbptSgks5tYQdvgaJpZM4RpUE8
.

jnothman on 25 Feb 2018

@jnothman Yes in a sense except with a twist. Say I suspect that some of the values from 1...1024 are meaningful. That is 22 indicates something specific which is quite different from 21 or 23. Taking logs won't help here. But I want to leave all the values over 1024 as numerical as I don't think those specific values mean much.

lesshaste on 25 Feb 2018

It sounds like you know too much about your variable for a generic
transform to be the sort of thing you need.

On 25 February 2018 at 20:37, lesshaste notifications@github.com wrote:

@jnothman https://github.com/jnothman Yes in a sense except with a
twist. Say I suspect that some of the values from 1...1024 are meaningful.
That is 22 indicates something specific which is quite different from 21 or

Taking logs won't help here. But I want to leave all the values over
1024 as numerical as I don't think those specific values mean much.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/scikit-learn/scikit-learn/issues/10521#issuecomment-368295895,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz65bOdVB6k7rCAcgLBYz_NslxXWV0ks5tYSnggaJpZM4RpUE8
.

jnothman on 25 Feb 2018

@jnothman To be a little clearer, I don't know that 22 is significant. I just suspect that some values are but I don't know which ones or how many there are. I have found the "convert to a string" and then DictVectorizer method to be very useful for discovering which these are.

lesshaste on 25 Feb 2018

@lesshaste For the issue about NaNs as separate category, see https://github.com/scikit-learn/scikit-learn/issues/10465
If you want to further discuss the specific non-linear discretization or mixed numeric/string encoding, feel free to open a new issue. But would like to keep this one focused on the original issue, i.e. the naming and organisation in different classes of the CategoricalEncoder/OneHotEncoder.

I'll try to look at this in the spring break in two weeks, ok? not sure if I'll have time before that :-/

@amueller that's fine. I won't have time the coming two weeks to work on the PR that is blocked by this anyway. After that I should also have time again to work on it.

jorisvandenbossche on 26 Feb 2018

👍1

@amueller did you have time to give this a look?

jorisvandenbossche on 20 Mar 2018

@amueller are you ok with that I go ahead with working on the PR to split CategoricalEncoder in OrdinalEncoder and OneHotEncoder (and with deprecating current arguments of OneHotEncoder) ?

jorisvandenbossche on 29 Mar 2018

👍2

Sorry for being absent. Seems ok, but can you maybe give me two weeks so I can actually review? Thanks!

amueller on 26 Apr 2018

@amueller no problem, for me the same :-)
But, I am now planning to look at this again. So if you could give this a look that would be welcome. I have some work to do on the PR (https://github.com/scikit-learn/scikit-learn/pull/10523), so don't review that yet in detail (you can look at it to have an idea of what we propose however).
I think the main question I want to see answered before I put a lot of time in it, is if you are OK with splitting up CategoricalEncoder into multiple classes, and in that case, if you are OK with re-using OneHotEncoder (which means deprecating some of its current (strange) features). Those questions are summarized in https://github.com/scikit-learn/scikit-learn/issues/10521#issuecomment-363851328 and https://github.com/scikit-learn/scikit-learn/issues/10521#issuecomment-364802471.

(and once we agree on that part, there is still a lot to discuss about the actual implementation in the PR :))

jorisvandenbossche on 14 May 2018

I updated the PR https://github.com/scikit-learn/scikit-learn/pull/10523, ready for review

jorisvandenbossche on 17 May 2018

I'll cautiously say I'm back ;)

amueller on 19 May 2018

IMHO the most important thing is a universal API (i.e. parameters and bbehavior patterns) for all of encoders we discuss

P.S. https://github.com/scikit-learn-contrib/categorical-encoding ?

kuraga on 28 May 2018

In the category_encoders package, all encoders have a cols argument, similar to the categorical_features in the old OneHotEncoder (although it does not accept exactly the same kind of values). See eg http://contrib.scikit-learn.org/categorical-encoding/onehot.html
So that is related to the current discussion we are having in https://github.com/scikit-learn/scikit-learn/pull/10523 about deprecating categorical_features or not.

For the rest I think there are not really conflicting keywords (they have some others specific to dataframes which we won't add to sklearn at this point). The naming for OneHotEncoder and OrdinalEncoder at least is consistent with the category_encoders package.

jorisvandenbossche on 28 May 2018

Was this page helpful?

0 / 5 - 0 ratings