When doing linear regression and encoding categorical variables, perfect collinearity can be a problem. To get around this, the suggested approach is to use n-1 columns. It would be useful if pd.get_dummies()
had a boolean parameter that returns n-1 for each categorical column that gets encoded.
Example:
>>> df
Account Network Device
0 Account1 Search Smartphone
1 Account1 Display Tablet
2 Account2 Search Smartphone
3 Account3 Display Smartphone
4 Account2 Search Tablet
5 Account3 Search Smartphone
>>> pd.get_dummies(df)
Account_Account1 Account_Account2 Account_Account3 Network_Display \
0 1 0 0 0
1 1 0 0 1
2 0 1 0 0
3 0 0 1 1
4 0 1 0 0
5 0 0 1 0
Network_Search Device_Smartphone Device_Tablet
0 1 1 0
1 0 0 1
2 1 1 0
3 0 1 0
4 1 0 1
5 1 1 0
Instead, I'd like to have some parameter such as drop_first=True
in get_dummies()
and it does something like this:
>>> new_df = pd.DataFrame(index=df.index)
>>> for i in df:
new_df = new_df.join(pd.get_dummies(df[i]).iloc[:, 1:])
>>> new_df
Account2 Account3 Search Tablet
0 0 0 1 0
1 0 0 0 1
2 1 0 1 0
3 0 1 0 0
4 1 0 1 1
5 0 1 1 0
Sources
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
http://stackoverflow.com/questions/31498390/how-to-get-pandas-get-dummies-to-emit-n-1-variables-to-avoid-co-lineraity
http://dss.princeton.edu/online_help/analysis/dummy_variables.htm
Sounds good, interested in submitting a pull request?
:+1:
Would be advantageous to allow dropping a specific value, not just the 'first'.
The omitted category (reference group) influences the interpretation of coefficients.
For example, one best practice is to omit the 'largest' value as the reference category;
````
hot = df[['vol_k', 'activation']]
cat_vars = list(df.select_dtypes(include=['category']).columns)
for var in cat_vars:
new = pd.get_dummies(df[var])
hot = hot.join(new)
#drop most frequent variable for ref category
drop_col = df.groupby([var]).size().idxmax()
hot.drop(drop_col, axis=1, inplace=True)
print(var + " dropping " + drop_col)
print(df.groupby([var]).size())`
```
Most helpful comment
Would be advantageous to allow dropping a specific value, not just the 'first'.
The omitted category (reference group) influences the interpretation of coefficients.
For example, one best practice is to omit the 'largest' value as the reference category;
````
hot = df[['vol_k', 'activation']]
cat_vars = list(df.select_dtypes(include=['category']).columns)
for var in cat_vars:
new = pd.get_dummies(df[var])
hot = hot.join(new)
```