Pandas: 共線性を回避するためのパンダget_dummies（）およびn-1カテゴリエンコーディングオプション？

作成日 2016年01月15日 · 3コメント · ソース: pandas-dev/pandas

線形回帰を実行してカテゴリ変数をエンコードする場合、完全な共線性が問題になる可能性があります。これを回避するために、推奨されるアプローチはn-1列を使用することです。 pd.get_dummies()に、エンコードされるカテゴリ列ごとにn-1を返すブールパラメータがあると便利です。

例：

>>> df
    Account  Network      Device
0  Account1   Search  Smartphone
1  Account1  Display      Tablet
2  Account2   Search  Smartphone
3  Account3  Display  Smartphone
4  Account2   Search      Tablet
5  Account3   Search  Smartphone

>>> pd.get_dummies(df)
   Account_Account1  Account_Account2  Account_Account3  Network_Display  \
0                 1                 0                 0                0   
1                 1                 0                 0                1   
2                 0                 1                 0                0   
3                 0                 0                 1                1   
4                 0                 1                 0                0   
5                 0                 0                 1                0   

   Network_Search  Device_Smartphone  Device_Tablet  
0               1                  1              0  
1               0                  0              1  
2               1                  1              0  
3               0                  1              0  
4               1                  0              1  
5               1                  1              0

代わりに、 get_dummies() drop_first=Trueなどのパラメーターを設定したいのですが、次のようになります。

>>> new_df = pd.DataFrame(index=df.index)
>>> for i in df:
    new_df = new_df.join(pd.get_dummies(df[i]).iloc[:, 1:])


>>> new_df
   Account2  Account3  Search  Tablet
0         0         0       1       0
1         0         0       0       1
2         1         0       1       0
3         0         1       0       0
4         1         0       1       1
5         0         1       1       0

ソース
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
http://stackoverflow.com/questions/31498390/how-to-get-pandas-get-dummies-to-emit-n-1-variables-to-avoid-co-lineraity
http://dss.princeton.edu/online_help/analysis/dummy_variables.htm

Reshaping

ソース

jaradc

最も参考になるコメント

'first'だけでなく、特定の値をドロップできるようにすると有利になります。

省略されたカテゴリ（参照グループ）は、係数の解釈に影響を与えます。

たとえば、ベストプラクティスの1つは、参照カテゴリとして「最大」の値を省略することです。

`` ``

hot = df [['vol_k'、 'activation']]

cat_vars = list（df.select_dtypes（include = ['category']）。columns）
cat_varsのvarの場合：
new = pd.get_dummies（df [var]）
hot = hot.join（new）

#drop most frequent variable for ref category
drop_col = df.groupby([var]).size().idxmax()
hot.drop(drop_col, axis=1, inplace=True)

print(var + " dropping " + drop_col)
print(df.groupby([var]).size())`

`` `

jcress 2017年11月09日

👍7

全てのコメント3件

プルリクエストの送信に興味がありますか？

TomAugspurger 2016年01月15日

：+1：

StephenKappel 2016年01月19日