Pandas: ์ธ๋ฑ์Šค๊ฐ€ ๋‹ค๋ฅธ DataFrame์— Series๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด Series๊ฐ€ ๋ชจ๋“  NaN์œผ๋กœ ๋ฐ”๋€๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2011๋…„ 12์›” 06์ผ  ยท  9์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: pandas-dev/pandas

์ง€๋ชฉ ์‚ฌํ•ญ:

>>> df
               RP/Rsum  P.value
ID                             
A_23_P42353    17.8     0      
A_23_P369994   15.91    0      
A_33_P3262440  436.7    0.0005 
A_32_P199429   18.97    0      
A_23_P256724   22.24    0      
A_33_P3394689  24.24    0      
A_33_P3403117  27.14    0      
A_24_P252364   28.56    0      
A_23_P99515    31.82    0      
A_24_P261750   31.46    0 

>>> df.dtypes
RP/Rsum    float64
P.value    float64

>>> ids = pandas.Series(['51513', '9201', np.nan, np.nan, '8794', '6530', '7025', '4897', '84935', '11081'])
>>> df["test"] = ids
>>> df
               RP/Rsum  P.value  test
ID                                   
A_23_P42353    17.8     0        NaN 
A_23_P369994   15.91    0        NaN 
A_33_P3262440  436.7    0.0005   NaN 
A_32_P199429   18.97    0        NaN 
A_23_P256724   22.24    0        NaN 
A_33_P3394689  24.24    0        NaN 
A_33_P3403117  27.14    0        NaN 
A_24_P252364   28.56    0        NaN 
A_23_P99515    31.82    0        NaN 
A_24_P261750   31.46    0        NaN 
>>> df.dtypes
RP/Rsum    float64
P.value    float64
test       object

์ด๊ฒƒ์€ float ๊ฐ์ฒด ๋“ฑ์—์„œ๋„ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ๋ฐฉ์•„์‡ ๊ฐ€ ๋ฌด์—‡์ธ์ง€ ํ™•์‹คํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

Series๋Š” ์•”์‹œ์  0, ..., N-1 ์ธ๋ฑ์Šค๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์„ ๋•Œ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๊ฒƒ์ด ์ •ํ™•ํžˆ ์ œ๊ฐ€ ์˜ˆ์ƒํ•˜๋Š” ๋™์ž‘์ž…๋‹ˆ๋‹ค. data ๊ฐ€ ์›์‹œ ndarray ๋˜๋Š” ๋ชฉ๋ก์ด๋ฉด ์ด๋Ÿฌํ•œ ์ผ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋‹น์‹ ์ด ํ•  ๋•Œ ์‚ฌ์‹ค :

df[col] = series

๊ทธ๋ฆฌ๊ณ  ์‹œ๋ฆฌ์ฆˆ๊ฐ€ df ์˜ ์ธ๋ฑ์Šค์™€ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฒ„๊ทธ๊ฐ€ ์•„๋‹Œ ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค. :) ๊ทธ๋ž˜์„œ

df['test'] = ids.values

๊ท€ํ•˜์˜ ์˜ˆ์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค

๋ชจ๋“  9 ๋Œ“๊ธ€

์˜ค๋Š˜ ์•„์นจ์—๋„ ๋ฐœ๊ฒฌํ•œ ์ด ๋ฌธ์ œ์™€ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

>>> df = pandas.DataFrame(index=[1,2,3,4])
>>> df["test"] = pandas.Series(["B", "fdf", "344", np.nan])
>>> df["test2"] = ["B", "fdf", "344", np.nan]
>>> df   test  test2
1  fdf   B    
2  344   fdf  
3  NaN   344  
4  NaN   nan  

๋‚˜์—๊ฒŒ ์ผ์ข…์˜ ์˜คํ”„ ๋ฐ”์ด 1 ์˜ค๋ฅ˜์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค.

๋” ํŒŒ๊ณ ๋“ค๋ฉด ํ•ญ๋ชฉ์„ ๋ฒ”์ธ์œผ๋กœ ์„ค์ •ํ•  ๋•Œ Series.reindex ์— ๋Œ€ํ•œ ํ˜ธ์ถœ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค.

>>> data 
0    B
1    fdf
2    344
3    NaN

>>>  df.index = ["A", "B", "C", "D"]
>>> data.reindex(df.index).values
array([nan, nan, nan, nan], dtype=object)

๋” ๋งŽ์€ ํŒŒ๊ณ ๋Š” ์ด์ƒํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜๋Š” ํ˜ธ์ถœ๋˜๋Š” ์ธ๋ฑ์Šค ์†์„ฑ์—์„œ reindex ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค.

>>> data.index.reindex(df.index)
(Index([A, B, C, D], dtype=object), array([-1, -1, -1, -1], dtype=int32))

์ด -1์€ NaN์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.

๋” ์ •ํ™•ํ•œ ์„ค๋ช…์œผ๋กœ ๋ฒ„๊ทธ ์ œ๋ชฉ์„ ์—…๋ฐ์ดํŠธํ–ˆ์Šต๋‹ˆ๋‹ค.

Series๋Š” ์•”์‹œ์  0, ..., N-1 ์ธ๋ฑ์Šค๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์„ ๋•Œ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๊ฒƒ์ด ์ •ํ™•ํžˆ ์ œ๊ฐ€ ์˜ˆ์ƒํ•˜๋Š” ๋™์ž‘์ž…๋‹ˆ๋‹ค. data ๊ฐ€ ์›์‹œ ndarray ๋˜๋Š” ๋ชฉ๋ก์ด๋ฉด ์ด๋Ÿฌํ•œ ์ผ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋‹น์‹ ์ด ํ•  ๋•Œ ์‚ฌ์‹ค :

df[col] = series

๊ทธ๋ฆฌ๊ณ  ์‹œ๋ฆฌ์ฆˆ๊ฐ€ df ์˜ ์ธ๋ฑ์Šค์™€ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฒ„๊ทธ๊ฐ€ ์•„๋‹Œ ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค. :) ๊ทธ๋ž˜์„œ

df['test'] = ids.values

๊ท€ํ•˜์˜ ์˜ˆ์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค

์ด ๊ฒฝ์šฐ ์•„์ง ๋ฌธ์„œํ™”๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ์–ด๋”˜๊ฐ€์— ๋ฌธ์„œํ™”๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๋™์•ˆ ๋‹น์‹ ์ด ์ œ์•ˆํ•œ ๋Œ€๋กœ ๋‚ด ์ž์‹ ์˜ ์ฝ”๋“œ๋ฅผ ์กฐ์ •ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

http://pandas.sourceforge.net/dsintro.html#column -selection-addition-deletion

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrameโ€™s index:

In [180]: df['one_trunc'] = df['one'][:2]

In [181]: df
Out[181]: 
   one  flag   foo  one_trunc
a  1    False  bar  1        
b  2    False  bar  2        
c  3    True   bar  NaN      
d  NaN  False  bar  NaN      

You can insert raw ndarrays but their length must match the length of the DataFrameโ€™s index.

DataFrame๊ณผ ๋™์ผํ•œ ์ธ๋ฑ์Šค๊ฐ€ ์—†๋Š” Series๋ฅผ ์‚ฝ์ž…ํ•  ๋•Œ DataFrame์˜ ์ธ๋ฑ์Šค๋ฅผ ์ค€์ˆ˜ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค ๋’ค์— ์ˆจ์€ ์•„์ด๋””์–ด๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?

์‹œ๋ฆฌ์ฆˆ์—์„œ DataFrame์„ ๋งŒ๋“ค ๋•Œ ๊ฒฐ๊ณผ ์ธ๋ฑ์Šค๋Š” ๋ชจ๋“  ๊ฐœ๋ณ„ ์‹œ๋ฆฌ์ฆˆ ์ธ๋ฑ์Šค๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด df['new_column'] = series์ผ ๋•Œ ์ด ์•„์ด๋””์–ด๊ฐ€ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?
๊ทธ๋ž˜์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ ค๊ณ  ํ•˜์ง€๋งŒ DataFrame ์ธ๋ฑ์Šค์™€ ์ผ์น˜ํ•˜์ง€ ์•Š๋Š” ๋ชจ๋“  ๊ฐ’์„ ๋ฌด์‹œํ•ฉ๋‹ˆ๊นŒ?
_index extension_์ด ์กด์žฌํ•œ๋‹ค๋ฉด ์ธ๋ฑ์Šค(ํ˜„์žฌ ๋™์ž‘)๋ฅผ ํ™•์žฅํ•˜๊ณ  ์‹ถ์ง€ ์•Š์„ ๋•Œ ํ•ญ์ƒ df['new_column'] = series.reindex(df.index)๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

In [256]: df = pandas.DataFrame({'A': pandas.Series(['foo', 'bar'], index=['a', 'b']),
   .....:                        'B': pandas.Series([10, 20], index=['b', 'c'])})

In [257]: df
Out[257]:
   A    B
a  foo  NaN
b  bar  10.000
c  NaN  20.000

In [258]: df['C'] = pandas.Series(range(3), index=['a', 'c', 'd'])

In [259]: df
Out[259]:
   A    B       C
a  foo  NaN     0.000
b  bar  10.000  NaN
c  NaN  20.000  1.000

์œ„์˜ ์˜ˆ์—์„œ๋Š” DataFrame์—์„œ ํ–‰ 'd'๋ฅผ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค.

๊ธ€์Ž„, ๊ธฐ๋ณธ ์•„์ด๋””์–ด๋Š” DataFrame์ด "์‹œ๋ฆฌ์ฆˆ์˜ ๊ณ ์ • ๊ธธ์ด dict-like ์ปจํ…Œ์ด๋„ˆ"๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ช…์‹œ์  ์ธ๋ฑ์Šค ์—†์ด Series ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ DataFrame์„ ๊ตฌ์„ฑํ•  ๋•Œ ์ด๋“ค ๋ชจ๋‘์˜ ํ•ฉ์ง‘ํ•ฉ ์™ธ์—๋Š” ๋ช…๋ฐฑํ•œ ์ธ๋ฑ์Šค๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

์ธ๋ฑ์Šค๋ฅผ ์•”์‹œ์ ์œผ๋กœ ํ™•์žฅํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ฃผ์žฅ์„ ๋ณผ ์ˆ˜ ์žˆ์ง€๋งŒ ์–ด๋Š ์ชฝ์ด๋“  ์žฅ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰