Pandas: Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?

Created on 15 Jan 2016  ·  3Comments  ·  Source: pandas-dev/pandas

When doing linear regression and encoding categorical variables, perfect collinearity can be a problem. To get around this, the suggested approach is to use n-1 columns. It would be useful if pd.get_dummies() had a boolean parameter that returns n-1 for each categorical column that gets encoded.

Example:

>>> df
    Account  Network      Device
0  Account1   Search  Smartphone
1  Account1  Display      Tablet
2  Account2   Search  Smartphone
3  Account3  Display  Smartphone
4  Account2   Search      Tablet
5  Account3   Search  Smartphone
>>> pd.get_dummies(df)
   Account_Account1  Account_Account2  Account_Account3  Network_Display  \
0                 1                 0                 0                0   
1                 1                 0                 0                1   
2                 0                 1                 0                0   
3                 0                 0                 1                1   
4                 0                 1                 0                0   
5                 0                 0                 1                0   

   Network_Search  Device_Smartphone  Device_Tablet  
0               1                  1              0  
1               0                  0              1  
2               1                  1              0  
3               0                  1              0  
4               1                  0              1  
5               1                  1              0 

Instead, I'd like to have some parameter such as drop_first=True in get_dummies() and it does something like this:

>>> new_df = pd.DataFrame(index=df.index)
>>> for i in df:
    new_df = new_df.join(pd.get_dummies(df[i]).iloc[:, 1:])


>>> new_df
   Account2  Account3  Search  Tablet
0         0         0       1       0
1         0         0       0       1
2         1         0       1       0
3         0         1       0       0
4         1         0       1       1
5         0         1       1       0

Sources
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
http://stackoverflow.com/questions/31498390/how-to-get-pandas-get-dummies-to-emit-n-1-variables-to-avoid-co-lineraity
http://dss.princeton.edu/online_help/analysis/dummy_variables.htm

Reshaping

Most helpful comment

Would be advantageous to allow dropping a specific value, not just the 'first'.

The omitted category (reference group) influences the interpretation of coefficients.

For example, one best practice is to omit the 'largest' value as the reference category;

````

hot = df[['vol_k', 'activation']]

cat_vars = list(df.select_dtypes(include=['category']).columns)
for var in cat_vars:
new = pd.get_dummies(df[var])
hot = hot.join(new)

#drop most frequent variable for ref category
drop_col = df.groupby([var]).size().idxmax()
hot.drop(drop_col, axis=1, inplace=True)

print(var + " dropping " + drop_col)
print(df.groupby([var]).size())`

```

All 3 comments

Sounds good, interested in submitting a pull request?

:+1:

Would be advantageous to allow dropping a specific value, not just the 'first'.

The omitted category (reference group) influences the interpretation of coefficients.

For example, one best practice is to omit the 'largest' value as the reference category;

````

hot = df[['vol_k', 'activation']]

cat_vars = list(df.select_dtypes(include=['category']).columns)
for var in cat_vars:
new = pd.get_dummies(df[var])
hot = hot.join(new)

#drop most frequent variable for ref category
drop_col = df.groupby([var]).size().idxmax()
hot.drop(drop_col, axis=1, inplace=True)

print(var + " dropping " + drop_col)
print(df.groupby([var]).size())`

```

Was this page helpful?
0 / 5 - 0 ratings