Pandas: ๊ณต์„ ์„ฑ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•œ Pandas get_dummies() ๋ฐ n-1 ๋ฒ”์ฃผํ˜• ์ธ์ฝ”๋”ฉ ์˜ต์…˜?

์— ๋งŒ๋“  2016๋…„ 01์›” 15์ผ  ยท  3์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: pandas-dev/pandas

์„ ํ˜• ํšŒ๊ท€๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ธ์ฝ”๋”ฉํ•  ๋•Œ ์™„๋ฒฝํ•œ ๊ณต์„ ์„ฑ์ด ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋œ ์ ‘๊ทผ ๋ฐฉ์‹์€ n-1๊ฐœ์˜ ์—ด์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. pd.get_dummies() ์ธ์ฝ”๋”ฉ๋˜๋Š” ๊ฐ ๋ฒ”์ฃผํ˜• ์—ด์— ๋Œ€ํ•ด n-1์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ถ€์šธ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ์œผ๋ฉด ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ์‹œ:

>>> df
    Account  Network      Device
0  Account1   Search  Smartphone
1  Account1  Display      Tablet
2  Account2   Search  Smartphone
3  Account3  Display  Smartphone
4  Account2   Search      Tablet
5  Account3   Search  Smartphone
>>> pd.get_dummies(df)
   Account_Account1  Account_Account2  Account_Account3  Network_Display  \
0                 1                 0                 0                0   
1                 1                 0                 0                1   
2                 0                 1                 0                0   
3                 0                 0                 1                1   
4                 0                 1                 0                0   
5                 0                 0                 1                0   

   Network_Search  Device_Smartphone  Device_Tablet  
0               1                  1              0  
1               0                  0              1  
2               1                  1              0  
3               0                  1              0  
4               1                  0              1  
5               1                  1              0 

๋Œ€์‹  drop_first=True ์— get_dummies() drop_first=True ์™€ ๊ฐ™์€ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

>>> new_df = pd.DataFrame(index=df.index)
>>> for i in df:
    new_df = new_df.join(pd.get_dummies(df[i]).iloc[:, 1:])


>>> new_df
   Account2  Account3  Search  Tablet
0         0         0       1       0
1         0         0       0       1
2         1         0       1       0
3         0         1       0       0
4         1         0       1       1
5         0         1       1       0

์ถœ์ฒ˜
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
http://stackoverflow.com/questions/31498390/how-to-get-pandas-get-dummies-to-emit-n-1-variables-to-avoid-co-lineraity
http://dss.princeton.edu/online_help/analysis/dummy_variables.htm

Reshaping

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

'์ฒซ ๋ฒˆ์งธ'๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํŠน์ • ๊ฐ’์„ ๋“œ๋กญํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ƒ๋žต๋œ ๋ฒ”์ฃผ(์ฐธ์กฐ ๊ทธ๋ฃน)๋Š” ๊ณ„์ˆ˜ ํ•ด์„์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ํ•œ ๊ฐ€์ง€ ๋ชจ๋ฒ” ์‚ฌ๋ก€๋Š” '๊ฐ€์žฅ ํฐ' ๊ฐ’์„ ์ฐธ์กฐ ๋ฒ”์ฃผ๋กœ ์ƒ๋žตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

````

ํ•ซ = df[['vol_k', 'ํ™œ์„ฑํ™”']]

cat_vars = list(df.select_dtypes(include=['์นดํ…Œ๊ณ ๋ฆฌ']).columns)
cat_vars์˜ var:
์ƒˆ๋กœ์šด = pd.get_dummies(df[var])
ํ•ซ = ํ•ซ.์กฐ์ธ(์‹ ๊ทœ)

#drop most frequent variable for ref category
drop_col = df.groupby([var]).size().idxmax()
hot.drop(drop_col, axis=1, inplace=True)

print(var + " dropping " + drop_col)
print(df.groupby([var]).size())`

```

๋ชจ๋“  3 ๋Œ“๊ธ€

ํ’€ ๋ฆฌํ€˜์ŠคํŠธ ์ œ์ถœ์— ๊ด€์‹ฌ์ด ์žˆ์œผ์‹ ๊ฐ€์š”?

:+1:

'์ฒซ ๋ฒˆ์งธ'๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํŠน์ • ๊ฐ’์„ ๋“œ๋กญํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ƒ๋žต๋œ ๋ฒ”์ฃผ(์ฐธ์กฐ ๊ทธ๋ฃน)๋Š” ๊ณ„์ˆ˜ ํ•ด์„์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ํ•œ ๊ฐ€์ง€ ๋ชจ๋ฒ” ์‚ฌ๋ก€๋Š” '๊ฐ€์žฅ ํฐ' ๊ฐ’์„ ์ฐธ์กฐ ๋ฒ”์ฃผ๋กœ ์ƒ๋žตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

````

ํ•ซ = df[['vol_k', 'ํ™œ์„ฑํ™”']]

cat_vars = list(df.select_dtypes(include=['์นดํ…Œ๊ณ ๋ฆฌ']).columns)
cat_vars์˜ var:
์ƒˆ๋กœ์šด = pd.get_dummies(df[var])
ํ•ซ = ํ•ซ.์กฐ์ธ(์‹ ๊ทœ)

#drop most frequent variable for ref category
drop_col = df.groupby([var]).size().idxmax()
hot.drop(drop_col, axis=1, inplace=True)

print(var + " dropping " + drop_col)
print(df.groupby([var]).size())`

```

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰