Pandas: ์œ„์‹œ๋ฆฌ์ŠคํŠธ : ํ•™์Šต / ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„ ์›Œํฌ์— get_dummies ()๋ฅผ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ

์— ๋งŒ๋“  2014๋…„ 11์›” 28์ผ  ยท  21์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: pandas-dev/pandas

Pandas์—์„œ get_dummies ()๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ •๋ง ์ข‹์ง€๋งŒ ๊ธฐ๊ณ„ ํ•™์Šต์— ์œ ์šฉํ•˜๋ ค๋ฉด ๊ธฐ์ฐจ / ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„ ์›Œํฌ (๋˜๋Š” sklearn ์šฉ์–ด๋กœ "fit_transform"๋ฐ "transform")์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ์„ค๋ช…์ด ํ•„์š”ํ•˜๋ฉด ์•Œ๋ ค์ฃผ์„ธ์š”.

๊ทธ๋ž˜์„œ์ด ๊ธฐ๋Šฅ์„ Pandas์— ์ถ”๊ฐ€ํ•˜๊ธฐ์œ„ํ•œ ์œ„์‹œ๋ฆฌ์ŠคํŠธ ๋ฒ„๊ทธ ๋ณด๊ณ ์„œ๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ๋“ค์ด ์ด๊ฒƒ์ด Pandas์—์„œ ์œ ์šฉ ํ•  ๊ฒƒ์ด๋ผ๋Š” ๋ฐ ๋™์˜ํ•œ๋‹ค๋ฉด ํ’€ ์š”์ฒญ์„ ๋งŒ๋“ค ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค (๊ทธ๋ฆฌ๊ณ ์ด ํ”„๋กœ์ ํŠธ์— ๋Œ€ํ•œ ๋‚˜์˜ ์ฒซ ๋ฒˆ์งธ ๊ธฐ์—ฌ์— ๋Œ€ํ•ด ์•ฝ๊ฐ„์˜ ์ฝ”์นญ์„ํ•˜๊ณ  ์ฝ”๋“œ ๊ฒ€ํ† ๋ฅผ ํ•  ์˜ํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค).

Categorical Reshaping Usage Question

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๊ด€์ฐฐ๋˜์ง€ ์•Š์€ ๋ณ€์ˆ˜๋ฅผ _ ๊ฐ€๋Šฅํ•˜๊ฒŒ _ ์ง€์ •ํ•˜๋ ค๋ฉด ๋ณ€์ˆ˜๋ฅผ Categorical ๋กœ ์ง€์ •ํ•˜๋ฉด๋ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ ์ƒ์„ฑ์‹œ ๋˜๋Š” ์ดํ›„์— ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})

In [6]: df_train
Out[6]: 
    car  color
0  seat    red
1   bmw  green

In [7]: pd.get_dummies(df_train )
Out[7]: 
   car_seat  car_bmw  car_mercedes  color_green  color_red
0         1        0             0            0          1
1         0        1             0            1          0

์›๋ž˜ ์งˆ๋ฌธ์€ ์ž˜ ์ง€์ •๋˜์–ด ์žˆ์ง€ ์•Š์œผ๋ฏ€๋กœ ๋‹ซ์Šต๋‹ˆ๋‹ค.

๋ชจ๋“  21 ๋Œ“๊ธ€

์ƒ˜ํ”Œ ํ”„๋ ˆ์ž„์˜ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์ด์žˆ๋Š” ์˜์‚ฌ ์ฝ”๋“œ ์˜ˆ์ œ๊ฐ€ ์œ ์šฉ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@ chrish42 , ์˜ˆ๊ฐ€ ์ข‹์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋กœ scikit-learn์—๋Š” ํŒŒ์ดํ”„ ๋ผ์ธ์— ๋งž๋Š” OneHotEncoder ํด๋ž˜์Šค๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์™€ ๊ฐ™์€ ๊ฒƒ์ด ์ž‘๋™ํ•ด์•ผํ•ฉ๋‹ˆ๊นŒ?

import pandas as pd
from sklearn.pipeline import TransformerMixin

class DummyEncoder(TransformerMixin):

    def __init__(self, columns=None):

        self.columns = columns

    def transform(self, X, y=None, **kwargs):

        return pd.get_dummies(X, columns=self.columns)

    def fit(self, X, y=None, **kwargs):

        return self

๊ธฐ๋ถ€

In [15]: df
Out[15]: 
   A  B  C
0  1  a  a
1  2  b  a

In [16]: DummyEncoder().transform(df)
Out[16]: 
   A  B_a  B_b  C_a
0  1    1    0    1
1  2    0    1    1

์ปฌ๋Ÿผ ์ˆœ์„œ์—์ฃผ์˜ํ•˜์‹ญ์‹œ์˜ค.

@TomAugspurger , ์‹ค์ œ๋กœ sklearn ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„ ๋ผ์ธ ์ž์ฒด์™€์˜ ํ˜ธํ™˜์„ฑ์€ ์ €์—๊ฒŒ ๊ด€์‹ฌ์ด์žˆ๋Š” ๋ถ€๋ถ„์ด ์•„๋‹™๋‹ˆ๋‹ค. ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๊ฒƒ์€ get_dummes ()์— ์˜ํ•ด ์ˆ˜ํ–‰ ๋œ ๋ณ€ํ™˜์„ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ €์žฅ ํ•œ ๋‹ค์Œ ๋‘ ๋ฒˆ์งธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ฒซ ๋ฒˆ์งธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ๊ฐ’์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์ด ์žˆ๋”๋ผ๋„ ํ•ด๋‹น ๋ณ€ํ™˜์„์žˆ๋Š” ๊ทธ๋Œ€๋กœ ์ ์šฉ (์ •ํ™•ํžˆ ๋™์ผํ•œ ์—ด ์ƒ์„ฑ)ํ•˜๋Š” ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์‹ค์ œ๋กœ "ํ›ˆ๋ จ / ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„ ์›Œํฌ์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅ"์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด ์„ค๋ช…์ด ๋” ๋ช…ํ™•ํ•ฉ๋‹ˆ๊นŒ? (๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ์—ฌ์ „ํžˆ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋Š” ์˜ˆ๋ฅผ ์ถ”๊ฐ€ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.)

sklearn์˜ OneHotEncoder ํด๋ž˜์Šค๋ฅผ ์•Œ๊ณ  ์žˆ์ง€๋งŒ ๋‹ค๋ฅธ ์ œํ•œ ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” @ chrish42 ์™€ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ์šฐ์—ฐํžˆ ๋ฐœ๊ฒฌํ–ˆ๊ณ  get_dummies๊ฐ€ ๋‚˜์—๊ฒŒ ๋‘ํ†ต์„์ฃผ๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ˜„์žฌ get dummies์˜ ํ•œ๊ณ„์˜ ์˜ˆ

๋‹ค์Œ df_train DataFrame์˜ ๋ฐ์ดํ„ฐ๋กœ ์ž‘์—…ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

```.python
df_train = pandas.DataFrame ({ "car": [ "seat", "bmw"], "color": [ "red", "green"]})
pandas.get_dummies (df_train)

car_bmw car_seat color_green color_red
0 0 1 0 1
1 1 0 1 0

Then we are provided with

``` .python
df_test = pandas.DataFrame({"car":["seat","mercedes"], "color":["red","green"]})
pandas.get_dummies(df_test )

         car_mercedes  car_seat  color_green  color_red
0             0         1            0          1
1             1         0            1          0

df_train์—์„œ ๋ณ€์ˆ˜ "car"์— ๋Œ€ํ•œ "mercedes"๊ฐ’์„ ๋ณธ ์ ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•ซ ์ธ์ฝ”๋”ฉ์„ ์–ป์„ ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.

```.python
car_bmw car_seat color_green color_red
0 0 1 0 1
1 0 0 1 0

Where the column car_mercedes actually never appears.

This could be solved by allowing get_dummies to receive an input dictionary stating the accepted values that we allow for each column.  

Returning to the previous example, we could give as input to get_dummies the following dict of sets

``` .python
accepted_values_per_column = {'car': {'bmw', 'seat'}, 'color': {'green', 'red'}}

get_dummies๊ฐ€

```.python
get_dummies (df_test, Accepted_values_per_column = Accepted_values_per_column)

       car_bmw  car_seat  color_green  color_red

0 0 1 0 1
1 0 0 1 0
```

get_dummies (df_test)๊ฐ€ ์ด๋ฏธ ๋ฐ˜ํ™˜ ๋œ ๊ฒƒ์„ ๋ฐ˜ํ™˜ ํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค.

๊ด€์ฐฐ๋˜์ง€ ์•Š์€ ๋ณ€์ˆ˜๋ฅผ _ ๊ฐ€๋Šฅํ•˜๊ฒŒ _ ์ง€์ •ํ•˜๋ ค๋ฉด ๋ณ€์ˆ˜๋ฅผ Categorical ๋กœ ์ง€์ •ํ•˜๋ฉด๋ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ ์ƒ์„ฑ์‹œ ๋˜๋Š” ์ดํ›„์— ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})

In [6]: df_train
Out[6]: 
    car  color
0  seat    red
1   bmw  green

In [7]: pd.get_dummies(df_train )
Out[7]: 
   car_seat  car_bmw  car_mercedes  color_green  color_red
0         1        0             0            0          1
1         0        1             0            1          0

์›๋ž˜ ์งˆ๋ฌธ์€ ์ž˜ ์ง€์ •๋˜์–ด ์žˆ์ง€ ์•Š์œผ๋ฏ€๋กœ ๋‹ซ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ ์ธ์ฝ”๋”ฉ์—์„œ ๋‹ค์‹œ Categorical๋กœ ์ด๋™ํ•  ๋•Œ๋Š” Categorical.from_codes๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์›์น˜ ์•Š๋Š” ์กฐ์–ธ์ด ํ•˜๋‚˜ ๋” ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฒ”์ฃผ ํ˜•์— ๋Œ€ํ•œ ๊ณ„์ˆ˜์˜ ์ •ํ™•ํ•œ ์ถ”์ •์— ๊ด€์‹ฌ์ด ์žˆ๋‹ค๋ฉด ์ธ์ฝ”๋”ฉ ๋œ ์—ด ์ค‘ ํ•˜๋‚˜๋ฅผ ์‚ญ์ œํ•˜์ง€ ์•Š์œผ๋ฉด ์ ˆํŽธ๊ณผ ํ•จ๊ป˜ ๋‹ค์ค‘ ๊ณต์„  ์„ฑ์„ ๊ฐ–๊ฒŒ๋ฉ๋‹ˆ๋‹ค (์žˆ๋Š” ๊ฒฝ์šฐ).

2015 ๋…„ 10 ์›” 5 ์ผ 05:34์— Jeff Reback [email protected] ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ผ์Šต๋‹ˆ๋‹ค.

๊ด€์ธก๋˜์ง€ ์•Š์„ ์ˆ˜์žˆ๋Š” ๋ณ€์ˆ˜๋ฅผ ์ง€์ •ํ•˜๋ ค๋ฉด ๋ณ€์ˆ˜๋ฅผ ๋ฒ”์ฃผ ํ˜•์œผ๋กœ ๋งŒ๋“ค๊ธฐ ๋งŒํ•˜๋ฉด๋ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ ์ƒ์„ฑ์‹œ ๋˜๋Š” ์ดํ›„์— ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

[5]์—์„œ : df_train = pd.DataFrame ({ "car": Series ([ "seat", "bmw"]). astype ( 'category', categories = [ 'seat', 'bmw', 'mercedes'] ), "color": [ "red", "green"]})

[6] : df_train
์ถœ๋ ฅ [6] :
์ž๋™์ฐจ ์ƒ‰์ƒ
0 ์ขŒ์„ ๋นจ๊ฐ„์ƒ‰
1 BMW ๊ทธ๋ฆฐ

[7] : pd.get_dummies (df_train)
์ถœ๋ ฅ [7] :
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
101010
์›๋ž˜ ์งˆ๋ฌธ์€ ์ž˜ ์ง€์ •๋˜์–ด ์žˆ์ง€ ์•Š์œผ๋ฏ€๋กœ ๋‹ซ์Šต๋‹ˆ๋‹ค.

โ€”
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ฑฐ๋‚˜ GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.

@TomAugspurger @jreback ์ตœ๊ทผ์— ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉฐ ์˜ˆ๋ฅผ ๋“ค์–ด๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

train_a = pd.DataFrame ({ "IsBadBuy": [0,1,0], "์ œ์กฐ์—…์ฒด": [ 'Toyota', 'Mazda', 'BMW']})

IsBadBuy Make_BMW Make_Mazda Make_Toyota
0 0 0 0 1
1 1 0 1 0
2 0 1 0 0

test_a = pd.DataFrame ({ "Make": [ 'Toyota', 'BMW']})
print pd.get_dummies (test_a, columns = [ 'Make'])

Make_BMW Make_Toyota
0 0 1
1 1 0

์—ฌ๊ธฐ์„œ ์ด์ƒ์ ์œผ๋กœ๋Š” ML ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋™์ผํ•œ ์ˆ˜์˜ ๊ธฐ๋Šฅ์„ ์˜ˆ์ƒํ•˜๊ณ  ํ…Œ์ŠคํŠธ์—์„œ ์–ป์€ ๊ฐ’์ด ํ•™์Šต์—์„œ ์–ป์€ ๊ฐ’์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์ด ๋  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— Make_Mazda ์—ด์„ ๋ณด์กดํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

๋ฒ”์ฃผ ํ˜•์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. ์˜ฌ๋ฐ”๋ฅธ ์ˆ˜์˜ ์—ด๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค. ๋‹น์‹ ์ด ๊ด€์‹ฌ์ด ์žˆ๋‹ค๋ฉด ์ด๊ฒƒ์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐํ–ˆ์Šต๋‹ˆ๋‹ค https://m.youtube.com/watch?v=KLPtEBokqQ0

    _____________________________

๋ณด๋‚ธ ์‚ฌ๋žŒ : Ajay Saxena [email protected]
๋ณด๋‚ธ ๋‚ ์งœ : 2017 ๋…„ 1 ์›” 12 ์ผ ๋ชฉ์š”์ผ 18:31
์ œ๋ชฉ : Re : [pandas-dev / pandas] ์œ„์‹œ๋ฆฌ์ŠคํŠธ : get_dummies ()๋ฅผ ํ•™์Šต / ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„ ์›Œํฌ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค์ • (# 8918)
๋ฐ›๋Š” ์‚ฌ๋žŒ : pandas-dev / pandas [email protected]
์ฐธ์กฐ : Tom Augspurger [email protected] , Mention [email protected]

@jreback ์ตœ๊ทผ์— ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐ

train_a = pd.DataFrame ({ "IsBadBuy": [0,1,0], "์ œ์กฐ์—…์ฒด": [ 'Toyota', 'Mazda', 'BMW']})

IsBadBuy Make_BMW Make_Mazda Make_Toyota
0 0 0 0 1
1 1 0 1 0
2 0 1 0 0

test_a = pd.DataFrame ({ "Make": [ 'Toyota', 'BMW']})
print pd.get_dummies (test_a, columns = [ 'Make'])

Make_BMW Make_Toyota
0 0 1
1 1 0

์—ฌ๊ธฐ์„œ ์ด์ƒ์ ์œผ๋กœ๋Š” ML ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋™์ผํ•œ ์ˆ˜์˜ ๊ธฐ๋Šฅ์„ ์˜ˆ์ƒํ•˜๊ณ  ํ…Œ์ŠคํŠธ์—์„œ ์–ป์€ ๊ฐ’์ด ํ•™์Šต์—์„œ ์–ป์€ ๊ฐ’์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์ด ๋  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— Make_Mazda ์—ด์„ ๋ณด์กดํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰ ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ฑฐ๋‚˜ GitHub์—์„œ ๋ณด๊ฑฐ๋‚˜ ์Šค๋ ˆ๋“œ๋ฅผ ์Œ์†Œ๊ฑฐํ•˜์‹ญ์‹œ์˜ค.

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค @TomAugspurger

@TomAugspurger ๊ฐ€ ์ œ๊ณต ํ•œ PyData Chicago 2016 ๊ฐ•์—ฐ์€ ์ •๋ง ํ›Œ๋ฅญํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Š”์ด ๋ฌธ์ œ / ์š”์ฒญ์„ ์ข…๊ฒฐํ•ด์„œ๋Š” ์•ˆ๋˜๋Š” ๋ชจ๋“  ์ด์œ ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ํ™˜์ƒ์ ์ธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. IMHO ๊ทธ์˜ ํด๋ž˜์Šค DummyEncoder ๋˜๋Š” ์ ์ ˆํ•œ ๋™๋“ฑํ•œ ์ผ๋ถ€๊ฐ€ Pandas์— ํฌํ•จ๋˜์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ, ๊ทธ์˜ github๋กœ ์ด๋™ํ•˜์—ฌ ๊ทธ์˜ ํด๋ž˜์Šค๋ฅผ ๋ณต์‚ฌ / ์—๋ฎฌ๋ ˆ์ด์…˜ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋‚ด์—์„œ ์ง€์›ํ•˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋” ์ข‹์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋ง ์ดˆ๊ธฐ์— ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.
pandas ๋ฐ scikit-learn๊ณผ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ pandas๋Š” scikit-learn์— ์˜์กดํ•˜์ง€ ์•Š์œผ๋ฉฐ ๊ทธ ๋ฐ˜๋Œ€์˜ ๊ฒฝ์šฐ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ƒ๊ฐํ•œ๋‹ค
๋‘˜ ๋‹ค ์œ„์— ์ง€์–ด์ง„ ๋˜ ๋‹ค๋ฅธ ๋„์„œ๊ด€์„์œ„ํ•œ ๊ณต๊ฐ„.

2017 ๋…„ 5 ์›” 10 ์ผ ์ˆ˜์š”์ผ ์˜คํ›„ 6:13, Brian Wylie [email protected]
์ผ๋‹ค :

@TomAugspurger๊ฐ€ ์ œ๊ณต ํ•œ PyData Chicago 2016 ๊ฐ•์—ฐ
https://github.com/TomAugspurger ๋Š” ์ •๋ง ์ž˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Š”
์ด ๋ฌธ์ œ / ์š”์ฒญ์ด ํ•„์š”ํ•œ ๋ชจ๋“  ์ด์œ ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ํ™˜์ƒ์ ์ธ ์ž‘์—…
๋‹ซํžˆ์ง€ ๋งˆ์‹ญ์‹œ์˜ค. IMHO ๊ทธ์˜ ํด๋ž˜์Šค DummyEncoder ๋˜๋Š” ํ•ฉ๋ฆฌ์ ์ธ
์ด์— ์ƒ์‘ํ•˜๋Š” ๋‚ด์šฉ์ด Pandas์— ํฌํ•จ๋˜์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๋„ค github์— ๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
๊ทธ์˜ ์ˆ˜์—…์„ ๋ณต์‚ฌ / ์—๋ฎฌ๋ ˆ์ด์…˜ํ•˜์ง€๋งŒ ๊ทธ๊ฒƒ์„ ๊ฐ–๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋” ์ข‹์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋‚ด์—์„œ ์ง€์›๋ฉ๋‹ˆ๋‹ค.

BTW ๋‚ด ์ƒ๊ฐ @TomAugspurger https://github.com/TomAugspurger
์ƒˆ๋กœ์šด ์ข‹์•„ํ•˜๋Š” PyData ์ „๋ฌธ๊ฐ€. ๋‚˜๋Š” ๊ทธ๊ฐ€์žˆ๋Š” ๋ชจ๋“  ๊ฒƒ์„ ์‚ฌ๋ƒฅ ํ• ๊ฑฐ์•ผ
์™„๋ฃŒ / ์ž‘์—…ํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ํก์ˆ˜ํ•˜๋ ค๊ณ  .. ์†Œ๋ฆ„ ๋ผ์น˜๋Š” / ์Šคํ† ํ‚น ๋ฐฉ์‹์œผ๋กœ .. ๋‹น์‹ 
์ „ํ˜€ ์†Œ๋ฆ„ ๋ผ์น˜ ์ง€ ์•Š๋Š” ์ •์ƒ์ ์ธ ๋ฐฉ์‹์œผ๋กœ ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. :)

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰ ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/pandas-dev/pandas/issues/8918#issuecomment-300638388 ,
๋˜๋Š” ์Šค๋ ˆ๋“œ ์Œ์†Œ๊ฑฐ
https://github.com/notifications/unsubscribe-auth/ABQHIpTqgHSE7iFVF9Pp4_YoKB9DPLcEks5r4kSrgaJpZM4DB6Hb
.

์—ฌ๊ธฐ์— ๋„์›€์ด ๋ ๋งŒํ•œ ์šฐ๋ฆฌ ์ค‘ ์ผ๋ถ€๊ฐ€ ์ž‘์—… ํ•œ ์ž‘์€ ํ•ด๊ฒฐ์ฑ…์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ ํ•ฉ / ๋ณ€ํ™˜ ๊ธฐ๋Šฅ์ด์žˆ๋Š” ๋”๋ฏธ ๋ณ€์ˆ˜.

https://github.com/joeddav/get_smarties

ํ”ผ๋“œ๋ฐฑ๊ณผ ๊ธฐ์—ฌ๊ฐ€ ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค!

# 14017๊ณผ ๊ด€๋ จ๋œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ์— ์ •ํ™•ํžˆ ๋„์›€์ด ๋  ์ˆ˜์žˆ๋Š” ์†”๋ฃจ์…˜์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ์—ด์ฐจ ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„ ์›Œํฌ์—์„œ ํ•ซ ์ธ์ฝ”๋”ฉ ๋ฒ”์ฃผ ํ˜• ๋ณ€์ˆ˜ ํ•˜๋‚˜. ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ๋„ˆ๋ฌด ์ปค์„œ ๋จธ์‹  ๋ฉ”๋ชจ๋ฆฌ์— ๋งž์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋„ ์ฒ˜๋ฆฌ ํ•  ์ˆ˜ โ€‹โ€‹์žˆ์Šต๋‹ˆ๋‹ค.

https://github.com/yashu-seth/dummyPy

์—ฌ๊ธฐ ์—์„œ ์ด์— ๋Œ€ํ•œ ์ž‘์€ ์ž์Šต์„œ๋ฅผ ์ฐพ์„ ์ˆ˜๋„

@TomAugspurger ์ด ์ฝ”๋“œ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ”„๋กœ๋•์…˜ ๋‹จ์ผ ๋ ˆ์ฝ”๋“œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•˜๋ ค๋ฉด ํ˜„์žฌ ์กด์žฌํ•˜๋Š” ๋‹จ์ผ ๊ฐ’์— ๋Œ€ํ•ด ํ•ซ ์ธ์ฝ”๋”ฉ ๋œ ์—ด ํ•˜๋‚˜๋งŒ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
๋‚ด๊ฐ€ ๋ฌด์—‡์„ ๋†“์น˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

์ˆ˜์ž… pyodbc
์ˆ˜์ž… ํ”ผํด
sklearn.linear_model์—์„œ ๊ฐ€์ ธ ์˜ค๊ธฐ LogisticRegression
sklearn.linear_model์—์„œ ๊ฐ€์ ธ ์˜ค๊ธฐ LinearRegression

numpy๋ฅผ np๋กœ ๊ฐ€์ ธ ์˜ค๊ธฐ
ํŒฌ๋”๋ฅผ pd๋กœ ๊ฐ€์ ธ ์˜ค๊ธฐ
sklearn.pipeline import TransformerMixin์—์„œ
sklearn.pipeline์—์„œ ๊ฐ€์ ธ ์˜ค๊ธฐ make_pipeline

ํด๋ž˜์Šค DummyEncoder (TransformerMixin) :
def fit (self, X, y = None) :
self.index_ = X.index
self.columns_ = X.columns
self.cat_columns_ = X.select_dtypes (include = [ 'category']). columns
self.non_cat_columns_ = X.columns.drop (self.cat_columns_)

    self.cat_map_ = {col: X[col].cat for col in self.cat_columns_}

    left = len(self.non_cat_columns_)
    self.cat_blocks_ = {}
    for col in self.cat_columns_:
        right = left + len(X[col].cat.categories)
        self.cat_blocks_[col], left = slice(left, right), right
    return self

def transform(self, X, y=None):
    return np.asarray(pd.get_dummies(X))

def inverse_transform(self, X):
    non_cat = pd.DataFrame(X[:, :len(self.non_Cat_columns_)],
                             columns=self.non_cat_columns_)
    cats = []
    for col, cat in self.cat_map_.items():
        slice_ = self.cat_blocks_[col]
        codes = X[:, slice_].argmax(1)
        series = pd.Series(pd.Categorical.from_codes(
                codes, cat.categories, ordered=cat.ordered
        ), name=col)
        cats.append(series)
    df = pd.concat([non_cat] + cats, axis=1)[self.columns_]
    return df

SQL์—์„œ Pandas Dataframe์œผ๋กœ ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ ์˜ค๊ธฐ

cnxn = pyodbc.connect ( 'DRIVER = {SQL Server}; SERVER = {XXXXX}; DATABASE = {ML_Learn_Taxi}; UID = {XXXX}; PWD = {XXXX}')
SQL = "" "
SELECT top 1 CONVERT (int, [order_key]) order_key
, CONVERT (int, [service_date_key]) service_date_key
, [order_source_desc]
, 1 as 'return_flag'
FROM [ML_Return_Customer]. [dbo]. [return_customers_test_set]
WHERE [order_source_desc] = '์˜จ๋ผ์ธ'
๋…ธ๋™ ์กฐํ•ฉ
SELECT ์ƒ์œ„ 2 CONVERT (int, [order_key])
, CONVERT (int, [service_date_key])
, [order_source_desc]
, 2
FROM [ML_Return_Customer]. [dbo]. [return_customers_test_set]
WHERE [order_source_desc] = '์ˆ˜์‹  ์ „ํ™”'
๋…ธ๋™ ์กฐํ•ฉ
SELECT top 1 CONVERT (int, [order_key])
, CONVERT (int, [service_date_key])
, [order_source_desc]
,1
FROM [ML_Return_Customer]. [dbo]. [return_customers_test_set]
WHERE [order_source_desc] = '๋ฐœ์‹  ์ „ํ™”'
"" "

prod_sql = "" "
SELECT top 1 CONVERT (int, [order_key]) order_key
, CONVERT (int, [service_date_key]) service_date_key
, [order_source_desc]
, 1 as 'return_flag'
FROM [ML_Return_Customer]. [dbo]. [return_customers_test_set]
WHERE [order_source_desc] = '์˜จ๋ผ์ธ'
"" "

InputDataSet = pd.read_sql (sql, cnxn)
ProdDataSet = pd.read_sql (prod_sql, cnxn)

print ( " * * * * ๋ฐ์ดํ„ฐ * * * * * ")
์ธ์‡„ (InputDataSet)

print ( " * ๋ฒ”์ฃผ ์—ด ์ •๋ณด * * ")
์—ด = [ 'order_source_desc']
InputDataSet [columns] = InputDataSet [columns] .apply (lambda x : x.astype ( 'category'))

InputDataSet.info ()

print ( " * ์„ ํ˜• ํšŒ๊ท€ * * ")

X = InputDataSet.drop ( 'return_flag', ์ถ• = 1)
y = InputDataSet [ 'return_flag']

A = ProdDataSet.drop ( 'return_flag', ์ถ• = 1)
B = ProdDataSet [ 'return_flag']

enc = DummyEncoder ()
enc.fit (X)

๋น„ = enc.transform (X)

Prod = enc.transform (A)

์ธ์‡„ (Prod)

์ถœ๋ ฅ : * * * * ๋ฐ์ดํ„ฐ * * * *
order_key service_date_key order_source_desc return_flag
0 10087937 20151214 ์˜จ๋ผ์ธ 1
1 10088174 20151201 ์ธ๋ฐ”์šด๋“œ ํ†ตํ™” 2
2 10088553 20151217 ์ˆ˜์‹  ์ „ํ™” 2
3663478 20160806 ์•„์›ƒ ๋ฐ”์šด๋“œ ํ†ตํ™” 1
* ์นดํ…Œ๊ณ ๋ฆฌ ์—ด ์ •๋ณด * *

RangeIndex : 4 ๊ฐœ ํ•ญ๋ชฉ, 0 ~ 3
๋ฐ์ดํ„ฐ ์—ด (์ด 4 ๊ฐœ ์—ด) :
order_key 4 null์ด ์•„๋‹Œ int64
service_date_key 4 null์ด ์•„๋‹Œ int64
order_source_desc 4 ๋„์ด ์•„๋‹Œ ๋ฒ”์ฃผ
return_flag 4 null์ด ์•„๋‹Œ int64
dtypes : category (1), int64 (3)
๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ : 284.0 ๋ฐ”์ดํŠธ
* ์„ ํ˜• ํšŒ๊ท€ * * *
[[10087937 20151214 1]]

๊ทธ๋ž˜์„œ ์ €๋Š”์ด ์Šค๋ ˆ๋“œ๊ฐ€ ์•ฝ๊ฐ„ ์ง€์ €๋ถ„ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฏ€๋กœ ์—ฌ๊ธฐ์— ๊ฐ„๋‹จํ•œ ํ•ด๊ฒฐ์ฑ…๊ณผ ์ด๊ฒƒ์ด ์–ด๋–ป๊ฒŒ ์ด๋ฏธ ๊ฐ€๋Šฅํ•œ์ง€ ์š”์•ฝํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•œ ์—ด์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ ์—ด๋กœ ์ผ๋ฐ˜ํ™” ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ "fit"ํ˜ธ์ถœ์—์„œ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•˜์‹ญ์‹œ์˜ค.

categories = sorted(training_data.iloc[:, column_index].value_counts(dropna=True).index)

ํ”ผํŒ…ํ•˜๋Š” ๋™์•ˆ ๋ฐฐ์šฐ๋Š” ์ƒํƒœ์— categories ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  "๋ณ€ํ™˜"์—์„œ ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

from pandas.api import types as pandas_types

categorical_data = testing_data.iloc[:, [column_index]].astype(
    pandas_types.CategoricalDtype(categories=categories),
)
one_hot_encoded = pandas.get_dummies(categorical_data)

๊ทธ๋ฆฌ๊ณ  ๊ฐ’์— ๋Œ€ํ•ด ํ•ญ์ƒ ๋™์ผํ•œ ๋งคํ•‘์—์„œ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์ค‘์— ์ผ๋ถ€ ๋ฒ”์ฃผ ๊ฐ’์ด ์—†์œผ๋ฉด ํ…Œ์ŠคํŠธ ์ค‘์— NaN์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค. ํ…Œ์ŠคํŠธ ์ค‘์— ์ผ๋ถ€ ๊ฐ’์ด ํ‘œ์‹œ๋˜์ง€ ์•Š์œผ๋ฉด ํ•ด๋‹น ์—ด์ด ์„ค์ •๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์•„์ฃผ ์ข‹์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ด๊ฒƒ์„ํ•˜๊ณ  ์‹ถ์€ ๋ชจ๋“  ์‚ฌ๋žŒ๋“ค์ด ๊ทธ๊ฒƒ์„ ์ƒˆ๋กญ๊ฒŒ ๋ฐœ๊ฒฌ ํ•  ํ•„์š”๊ฐ€ ์—†์—ˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค. ;-)

@mitar ๊ฐ€ ์ œ์•ˆํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ข‹์€ ์งง์€ ์˜ˆ์ž…๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ ๋” ๊ธด ํƒ๊ตฌ๋ฅผ ์œ„ํ•ด ์œ ์šฉํ•˜๊ณ  ๋„์›€์ด ๋  ์ˆ˜์žˆ๋Š” ๋…ธํŠธ๋ถ์ด ์žˆ์Šต๋‹ˆ๋‹ค : https://nbviewer.jupyter.org/github/SuperCowPowers/scp-labs/blob/master/notebooks/Categorical_Encoding_Dangers.ipynb

Kaggle XGBoost ํŠœํ† ๋ฆฌ์–ผ์˜ ์—ฐ์Šต์—์„œ ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ํŠธ๋ฆญ์ž…๋‹ˆ๋‹ค.

X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

๋‚˜๋Š” ๋˜ํ•œ ๊ฐ™์€ ๋ฌธ์ œ์— ์—ฌ๋Ÿฌ ๋ฒˆ ์ง๋ฉดํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” ๋‚˜๋ฅผ ์œ„ํ•ด ์ผ์„ ๋” ์‰ฝ๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์ˆ˜์—… (์ด ํ† ๋ก ์—์„œ ์•„์ด๋””์–ด๋ฅผ ์–ป์Œ)์„ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

import pandas
from sklearn.preprocessing import LabelEncoder

class CategoryEncoder:
    '''
    labelEncoding : boolean -> True If the categorical columns are to be label encoded
    oneHotEncoding : boolean -> True If the categorical columns are to be one hot encoded (using pandas.get_dummies method)
    dropFirst : boolean -> True if first column is to be dropped (usually to avoid multi-collinearity) post one hot encoding
                           Doesn't matter if oneHotEncoding = False

    df : pandas.DataFrame() -> dataframe object that needs to be encoded
    catCols : list -> list of the categorical columns that need to be encoded
    '''
    def __init__(self,labelEncoding=True,oneHotEncoding=False,dropFirst=False):
        self.labelEncoding = labelEncoding
        self.oneHotEncoding = oneHotEncoding
        self.dropFirst = dropFirst
        self.labelEncoder = {}
        self.oneHotEncoder = {}

    def fit(self,df,catCols=[]):
        df1 = df.copy()
        if self.labelEncoding:
            for col in catCols:
                labelEncoder = LabelEncoder()
                labelEncoder.fit(df1.loc[:,col].astype(str))
                df1.loc[:,col] = labelEncoder.transform(df1.loc[:,col])
                self.labelEncoder[col] = labelEncoder.classes_

        if self.oneHotEncoding:
            for col in catCols:
                cats = sorted(df1.loc[:,col].value_counts(dropna=True).index)
                self.oneHotEncoder[col] = cats

    def transform(self,df,catCols=[]):
        df1 = df.copy()
        if self.labelEncoding:
            for col in catCols:
                labelEncoder = self.labelEncoder[col]
                labelEncoder = {v:i for i,v in enumerate(labelEncoder.tolist())}
                print(labelEncoder)
                df1.loc[:,col] = df1.loc[:,col].map(labelEncoder)

        if self.oneHotEncoding:
            for col in catCols:
                oneHotEncoder = self.oneHotEncoder[col]
                df1.loc[:,col] = df1.loc[:,col].astype(pandas.CategoricalDtype(categories=oneHotEncoder))
            df1 = pandas.get_dummies(df1,columns=catCols,drop_first=self.dropFirst)

        return df1

์ธ์ฝ”๋”์˜ ์ธ์Šคํ„ด์Šค๋ฅผ ์‹œ์ž‘ํ•˜๊ณ  ์‚ฌ์šฉํ•˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค.

enc1 = CategoryEncoder(True,False)     # Will label encode but not one-hot encode
enc2 = CategoryEncoder(False,True,True)     # Will one-hot encode but not label encode
enc3 = CategoryEncoder(True,True,True)     # Will label encode first and then one-hot encode

# List of categorical columns you want to encode
categorical_columns = ['col_1', 'col_2']

enc1.fit(train_df, categorical_columns)
enc1.transform(test_df, categorical_columns) # Returns the dataframe encoded columns

์ฐธ๊ณ  : ์ด๊ฒƒ์€ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜์—†๋Š” ์—ด ์ด๋ฆ„ ์ „๋‹ฌ๊ณผ ๊ฐ™์€ ์˜ˆ์™ธ๋ฅผ ์ฒ˜๋ฆฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰