Pandas: groupby.agg์˜ relabeling dicts ์‚ฌ์šฉ ์ค‘๋‹จ์€ ๋งŽ์€ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2017๋…„ 11์›” 19์ผ  ยท  37์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: pandas-dev/pandas

์ด ๋ฌธ์ œ๋Š” groupby.agg ์—์„œ dicts์˜ ๋ ˆ์ด๋ธ” ์žฌ์ง€์ •์„ ์ค‘๋‹จํ•œ ํ›„ #15931์˜ ํ† ๋ก ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์— ์š”์•ฝ๋œ ๋งŽ์€ ๋‚ด์šฉ์€ ์ด์ „ ๋…ผ์˜์—์„œ ์ด๋ฏธ ๋…ผ์˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ https://github.com/pandas-dev/pandas/pull/15931#issuecomment -336139085 ๋ฌธ์ œ๊ฐ€ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ช…์‹œ๋˜์–ด ์žˆ๋Š” ๊ณณ์„ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค.

#15931 ์‚ฌ์šฉ ์ค‘๋‹จ์˜ ๋™๊ธฐ๋Š” ์ฃผ๋กœ Series์™€ Dataframe ์‚ฌ์ด์— agg() ๋Œ€ํ•œ ์ผ๊ด€๋œ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ๊ณผ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค(์ปจํ…์ŠคํŠธ๋Š” #14668 ์ฐธ์กฐ).

์ค‘์ฒฉ๋œ ์‚ฌ์ „์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ ˆ์ด๋ธ”์„ ๋‹ค์‹œ ์ง€์ •ํ•˜๋Š” ๊ธฐ๋Šฅ์€ ๋„ˆ๋ฌด ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ์ผ๊ด€์„ฑ์ด ์—†์–ด ๋” ์ด์ƒ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๊ฒƒ์œผ๋กœ ์ผ๋ถ€ ์„ค๋ช…๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์—ฌ๊ธฐ์—๋Š” ๋Œ€๊ฐ€๊ฐ€ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ์ง‘๊ณ„์™€ ์ด๋ฆ„ ๋ณ€๊ฒฝ์ด ๋™์‹œ์— ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฉด ๋งค์šฐ ์„ฑ๊ฐ€์‹  ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ํ•ฉ๋ฆฌ์ ์ธ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์ด ์—†๋Š” ์ด์ „ ๋ฒ„์ „๊ณผ์˜ ๋น„ํ˜ธํ™˜์„ฑ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

  • _[์„ฑ๊ฐ€์‹ ]_ ๊ฒฐ๊ณผ ์—ด์˜ ์ด๋ฆ„์„ ๋” ์ด์ƒ ์ œ์–ดํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
  • _[์„ฑ๊ฐ€์‹ ]_ MultiIndex์˜ ์ด๋ฆ„์„ ๋ฐ”๊พธ๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ง‘๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•œ ํ›„ _์ฝ”๋“œ์˜ ๋‘ ์œ„์น˜์—์„œ ์—ด ์ˆœ์„œ๋ฅผ ์ถ”์ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.... ์ „ํ˜€ ์‹ค์šฉ์ ์ด์ง€ ์•Š๊ณ  ๋•Œ๋กœ๋Š” ์™„์ „ํžˆ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค(์•„๋ž˜์˜ ๊ฒฝ์šฐ ).
  • โš ๏ธ _ [์ค‘๋‹จ] _์€ ๋™์ผํ•œ ์ž…๋ ฅ ์—ด์— ๋™์ผํ•œ ๋‚ด๋ถ€ ์ด๋ฆ„์„ ๊ฐ€์ง„ ๋‘˜ ์ด์ƒ์˜ ํ˜ธ์ถœ ๊ฐ€๋Šฅ ํ•ญ๋ชฉ์„ ์ ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๋‘ ๊ฐ€์ง€ ํ•˜์œ„ ์‚ฌ๋ก€๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

    • _ [์ค‘๋‹จ] _ ๊ฐ™์€ ์—ด์— ๋‘˜ ์ด์ƒ์˜ ๋žŒ๋‹ค ์ง‘๊ณ„์ž๋ฅผ ๋” ์ด์ƒ ์ ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

    • _ [์ค‘๋‹จ] _ ์ˆจ๊ฒจ์ง„ __name__ ์†์„ฑ์„ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๋Š” ํ•œ ๋ถ€๋ถ„ ํ•จ์ˆ˜์—์„œ ๋‘˜ ์ด์ƒ์˜ ์ง‘๊ณ„์ž๋ฅผ ๋” ์ด์ƒ ์ ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์˜ˆ์‹œ

_(์ด๊ฒƒ์€ ๊ฐ€๋Šฅํ•œ ํ•œ ์งง์€ ์ฝ”๋“œ๋กœ ๋ฌธ์ œ๋ฅผ ์‹œ์—ฐํ•  ๋ชฉ์ ์œผ๋กœ ์ œ์ž‘๋œ ์˜ˆ์ด์ง€๋งŒ ์—ฌ๊ธฐ์—์„œ ์‹œ์—ฐ๋œ ๋ชจ๋“  ๋ฌธ์ œ๋Š” ๋ณ€๊ฒฝ ์ดํ›„ ์‹ค์ƒํ™œ์—์„œ, ๊ทธ๋ฆฌ๊ณ  ์—ฌ๊ธฐ์ฒ˜๋Ÿผ ๊ฐ„๋‹จํ•˜์ง€ ์•Š์€ ์ƒํ™ฉ์—์„œ ์ €๋ฅผ ๋ฌผ์—ˆ์Šต๋‹ˆ๋‹ค. )_

์ž…๋ ฅ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„

mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6)
)
  cat  distance  energy
0   A      1.20    1.80
1   A      1.50    1.95
2   A      1.74    2.04
3   B      0.82    1.25
4   B      1.01    1.60
5   C      0.60    1.01

์ „์—:

์“ฐ๊ธฐ ์‰ฝ๊ณ  ์ฝ๊ธฐ ์‰ฝ๊ณ  ์˜ˆ์ƒ๋Œ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

import numpy as np
import statsmodels.robust as smrb
from functools import partial

# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)

# renaming and specifying the aggregators at the same time
# note that I want to choose the resulting column names myself
# for example "total_xxxx" instead of just "sum"
mydf_agg = mydf.groupby('cat').agg({
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),  # lambda
        'energy_p17': lambda x: np.percentile(x, 17),  # lambda
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,   # original function
        'distance_mad_c1': mad_c1,  # partial function wrapping the original function
    },
})

๊ฒฐ๊ณผ

          energy                             distance
    total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A           5.79     2.0364     1.8510           4.44            1.480     0.355825           0.240
B           2.85     1.5930     1.3095           1.83            0.915     0.140847           0.095
C           1.01     1.0100     1.0100           0.60            0.600     0.000000           0.000

๊ทธ๋ฆฌ๊ณ  ๋‚จ์€ ๊ฒƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

ํŒ๋‹ค๋ฅผ ์ฐฌ์–‘ํ•˜๋ฉฐ ์ฆ๊ฒ๊ฒŒ ์ถค์„ ์ถฐ์š”๐Ÿ’ƒ ๐Ÿ•บ !

ํ›„์—

import numpy as np
import statsmodels.robust as smrb
from functools import partial

# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)

# no way of choosing the destination's column names...
mydf_agg = mydf.groupby('cat').agg({
    'energy': [
        'sum',
        lambda x: np.percentile(x, 98), # lambda
        lambda x: np.percentile(x, 17), # lambda
    ],
    'distance': [
        'sum',
        'mean',
        smrb.mad, # original function
        mad_c1,   # partial function wrapping the original function
    ],
})

๋žŒ๋‹ค ํ•จ์ˆ˜๊ฐ€ ๋ชจ๋‘ <lambda> ๋ผ๋Š” ์—ด์„ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์œ„์˜ ๋‚ด์šฉ์ด ์ค‘๋‹จ๋ฉ๋‹ˆ๋‹ค.

SpecificationError: Function names must be unique, found multiple named <lambda>

์ด์ „ ๋ฒ„์ „๊ณผ ํ˜ธํ™˜๋˜์ง€ ์•Š๋Š” ํšŒ๊ท€: ๋™์ผํ•œ ์›๋ž˜ ์—ด์— ๋‘ ๊ฐœ์˜ ๋‹ค๋ฅธ ๋žŒ๋‹ค๋ฅผ ๋” ์ด์ƒ ์ ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์œ„์—์„œ lambda x: np.percentile(x, 98) ๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด ์›๋ž˜ ํ•จ์ˆ˜์—์„œ ํ•จ์ˆ˜ ์ด๋ฆ„์„ ์ƒ์†ํ•˜๋Š” ๋ถ€๋ถ„ ํ•จ์ˆ˜์™€ ๋™์ผํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

SpecificationError: Function names must be unique, found multiple named mad

๋งˆ์ง€๋ง‰์œผ๋กœ ๋ถ€๋ถ„์˜ __name__ ์†์„ฑ์„ ๋ฎ์–ด์“ด ํ›„(์˜ˆ: mad_c1.__name__ = 'mad_c1' ) ๋‹ค์Œ์„ ์–ป์Šต๋‹ˆ๋‹ค.

    energy          distance
       sum <lambda>      sum   mean       mad mad_c1
cat
A     5.79   1.8510     4.44  1.480  0.355825  0.240
B     2.85   1.3095     1.83  0.915  0.140847  0.095
C     1.01   1.0100     0.60  0.600  0.000000  0.000

์—ฌ์ „ํžˆ

  • ํ•˜๋‚˜์˜ ์—ด ๋ˆ„๋ฝ(98๋ฒˆ์งธ ๋ฐฑ๋ถ„์œ„์ˆ˜)
  • MultiIndex ์—ด ์ฒ˜๋ฆฌ
  • ์—ด ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ

๋ณ„๋„์˜ ๋‹จ๊ณ„์—์„œ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ง‘๊ณ„ ํ›„ ์—ด ์ด๋ฆ„์— ๋Œ€ํ•œ ์ œ์–ด๋Š” ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ž๋™ํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ตœ์„ ์€ ์›๋ž˜ ์—ด ์ด๋ฆ„๊ณผ _aggregate ํ•จ์ˆ˜์˜ ์ด๋ฆ„_์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์กฐํ•ฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

mydf_agg.columns = ['_'.join(col) for col in mydf_agg.columns]

๊ฒฐ๊ณผ:

     energy_sum  energy_<lambda>  distance_sum  distance_mean  distance_mad distance_mad_c1
cat
A          5.79           1.8510          4.44          1.480      0.355825           0.240
B          2.85           1.3095          1.83          0.915      0.140847           0.095
C          1.01           1.0100          0.60          0.600      0.000000           0.000

๋‹ค๋ฅธ ์ด๋ฆ„์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

mydf_agg.rename({
    "energy_sum": "total_energy",
    "energy_<lambda>": "energy_p17",
    "distance_sum": "total_distance",
    "distance_mean": "average_distance"
    }, inplace=True)

ํ•˜์ง€๋งŒ ์ด๋Š” ์ง‘๊ณ„๊ฐ€ ์ •์˜๋œ ์ฝ”๋“œ์™€ ๋™๊ธฐํ™”๋œ ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ ์ฝ”๋“œ(์ด์ œ ์ฝ”๋“œ์˜ ๋‹ค๋ฅธ ์œ„์น˜์— ์žˆ์–ด์•ผ ํ•จ)๋ฅผ ์œ ์ง€ํ•˜๋„๋ก ์ฃผ์˜ํ•ด์•ผ ํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค...

์•ˆํƒ€๊นŒ์šด ํŒ๋‹ค ์œ ์ € ๐Ÿ˜ข (์•„์ง๋„ ํŒ๋‹ค๋ฅผ ์‚ฌ๋ž‘ํ•ฉ๋‹ˆ๋‹ค)


์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์„ ์„ ๋‹คํ•˜๋Š” ๋™์‹œ์— _aggregate ๋ฐ rename_ ๊ธฐ๋Šฅ์˜ ์‚ฌ์šฉ ์ค‘๋‹จ์— ๋Œ€ํ•ด ๊นŠ์ด ์œ ๊ฐ์Šค๋Ÿฝ๊ฒŒ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ ์˜ˆ๋ฅผ ํ†ตํ•ด ๋ฌธ์ œ์ ์ด ๋ช…ํ™•ํ•ด์ง€๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.


๊ฐ€๋Šฅํ•œ ํ•ด๊ฒฐ์ฑ…

  • dict-of-dict ์žฌ๋ ˆ์ด๋ธ”๋ง ๊ธฐ๋Šฅ ์ง€์› ์ค‘๋‹จ ํ•ด์ œ
  • ๊ทธ๊ฒƒ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค๋ฅธ API๋ฅผ ์ œ๊ณตํ•˜์‹ญ์‹œ์˜ค(๊ทธ๋Ÿฌ๋‚˜ ๋™์ผํ•œ ์ฃผ์š” ๋ชฉ์ , ์ฆ‰ ์ง‘๊ณ„์— ๋Œ€ํ•ด ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์–ด์•ผ ํ•˜๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ?)
  • ??? (์ œ์•ˆ ๊ฐ€๋Šฅ)

_์„ ํƒ์  ์ฝ๊ธฐ:_

์ด๋ฏธ ๋ช‡ ๊ฐœ์›” ๋™์•ˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋Š” pull request์—์„œ ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋…ผ์˜์™€ ๊ด€๋ จํ•˜์—ฌ, ์ €๋Š” ์ตœ๊ทผ์— ์™€์„œ์•ผ ์ด ์ง€์› ์ค‘๋‹จ์— ๋Œ€ํ•ด ๊ฑฑ์ •ํ•˜๋Š” ์ด์œ  ์ค‘ ํ•˜๋‚˜๋ฅผ ๊นจ๋‹ฌ์•˜์Šต๋‹ˆ๋‹ค. "์ง‘๊ณ„ ๋ฐ ์ด๋ฆ„ ๋ณ€๊ฒฝ"์€ SQL์—์„œ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ง‘๊ณ„ ํ‘œํ˜„์‹ ๋ฐ”๋กœ ์˜†์— ๋Œ€์ƒ ์—ด ์ด๋ฆ„์„ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์— SQL์˜ GROUP BY ์ง‘๊ณ„(์˜ˆ: SELECT col1, avg(col2) AS col2_mean, stddev(col2) AS col2_var FROM mytable GROUP BY col1 .

๋‚ด๊ฐ€ _not_ ํŒฌ๋”๊ฐ€ ๋ฐ˜๋“œ์‹œ ๋ฌผ๋ก  SQL๊ณผ ๊ฐ™์€ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ด์•ผํ•œ๋‹ค๊ณ  ๋งํ•˜๊ณ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์œ„์— ์ œ๊ณต๋œ ์˜ˆ๋Š” dict-of-dict API๊ฐ€ ๋งŽ์€ ์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋Œ€ํ•œ ๊นจ๋—ํ•˜๊ณ  ๊ฐ„๋‹จํ•œ ์†”๋ฃจ์…˜์ธ ์ด์œ ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

(* dict-of-dict ์ ‘๊ทผ ๋ฐฉ์‹์ด ๋ณต์žกํ•˜๋‹ค๋Š” ๋ฐ ๊ฐœ์ธ์ ์œผ๋กœ ๋™์˜ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)

API Design Groupby

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๊ทธ๋งŒํ•œ ๊ฐ€์น˜๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‚˜๋Š” ๋˜ํ•œ ๊ธฐ๋Šฅ์„ ํ‰๊ฐ€์ ˆํ•˜ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์— ๊ฐ•๋ ฅํ•˜๊ฒŒ ์ฐฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋‚˜์—๊ฒŒ ํฐ ์ด์œ ๋Š” ํŒŒ์ด์ฌ์˜ ํ•จ์ˆ˜ ์ด๋ฆ„ ๊ณต๊ฐ„(ํŠน์ • ๊ตฌํ˜„๊ณผ ๊ด€๋ จ๋œ ๊ฒƒ)์„ ์—ด ์ด๋ฆ„๊ณผ ๋ฐ์ดํ„ฐ(๊ตฌํ˜„์— ๋Œ€ํ•ด ํ™•์‹คํžˆ ์•Œ์•„์•ผ ํ•˜๋Š” ๊ฒƒ)์™€ ํ˜ผํ•ฉํ•˜๋Š” ๋ฐ ๋งค์šฐ ์ด์ƒํ•œ ์ ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. '<lambda>' ๋ผ๋Š” ์—ด(์—ฌ๋Ÿฌ ์—ด)์ด ํ‘œ์‹œ๋œ๋‹ค๋Š” ์‚ฌ์‹ค์€ ์‹ฌ๊ฐํ•œ ์ธ์ง€ ๋ถ€์กฐํ™”๋ฅผ ์œ ๋ฐœํ•ฉ๋‹ˆ๋‹ค.

๋ถˆํ•„์š”ํ•œ(๋…ธ์ถœ๋œ) ์—ด ์ด๋ฆ„์ด ์‚ฌ์šฉ๋˜๋Š” ์ค‘๊ฐ„ ๋‹จ๊ณ„๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๋ฌธ์ œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตฌํ˜„์— ์ž ์žฌ์ ์œผ๋กœ ์ข…์†์„ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์•ˆ์ •์ ์ด๊ณ  ์ฒด๊ณ„์ ์œผ๋กœ ์ด๋ฆ„์„ ๋ฐ”๊พธ๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๊ทธ ์™ธ์—๋„ ์ค‘์ฒฉ๋œ dict ๊ธฐ๋Šฅ์€ ํ™•์‹คํžˆ ๋ณต์žกํ•˜์ง€๋งŒ ์ˆ˜ํ–‰ ์ค‘์ธ ๋ณต์žกํ•œ ์ž‘์—…์ž…๋‹ˆ๋‹ค.

TL;DR ํ‰๊ฐ€์ ˆํ•˜ํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค. :)

๋ชจ๋“  37 ๋Œ“๊ธ€

@zertrin : ํ•จ๊ป˜ํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ด๊ฒƒ์— ๋Œ€ํ•ด #15931์—์„œ ๋งŽ์€ ํ† ๋ก ์ด ์žˆ์—ˆ๋˜ ๊ฒƒ์„ ๋ณด์•˜๋‹ค. ์ œ๊ฐ€ ์ด ๊ธ€์„ ๋‹ค ์ฝ์ง€ ๋ชปํ•ด์„œ ์ง€๊ธˆ์€ ๋Œ“๊ธ€์„ ๋‹ฌ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํ•‘์„ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

@jjjjjjjjjjjjjjjjjjj ๋„˜ ๋„˜๋„˜ ์•„์ €์”จใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ…  ใ… ใ… ใ… 

์ด ์˜ˆ์—์„œ ํ˜„์žฌ agg ๊ตฌํ˜„์œผ๋กœ ์ด๋ฆ„์„ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด ๋งค์šฐ ํˆฌ๋ฐ•ํ•˜๊ณ  ๊นจ์ ธ ์žˆ๋‹ค๋Š” ์ ์— ๋™์˜ํ•ฉ๋‹ˆ๋‹ค. ์ค‘์ฒฉ๋œ ๋”•์…”๋„ˆ๋ฆฌ๋Š” ๋‹ค์†Œ ๋ณต์žกํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ๋ถ„์ด ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์ž‘์„ฑํ•˜๋ฉด ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๊ณ  ์žˆ๋Š”์ง€ ๋งค์šฐ ๋ช…ํ™•ํ•ด์ง‘๋‹ˆ๋‹ค.

๋‚˜๋Š”์ด์žˆ์„ ์ˆ˜๋„์žˆ์„ ๊ฒƒ ๊ฐ™๊ตฐ์š” names ์ถ”๊ฐ€ ๋งค๊ฐœ ๋ณ€์ˆ˜ agg ์ƒˆ๋กœ์šด ์ด๋ฆ„์œผ๋กœ ์‚ฌ์ „ ๋งคํ•‘์—๊ฒŒ ์ง‘๊ณ„ ์—ด์„ ๊ฑธ๋ฆด ๊ฒƒ์ด๋‹ค. ์ƒ์œ„ ์ธ๋ฑ์Šค ์ˆ˜์ค€์„ ์œ ์ง€ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค๋ฅธ ๋งค๊ฐœ๋ณ€์ˆ˜ drop_index ๋ฅผ ๋ถ€์šธ๋กœ ์ถ”๊ฐ€ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๊ตฌ๋ฌธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฐ”๋€๋‹ˆ๋‹ค.

agg_dict = {'energy': ['sum',
                       lambda x: np.percentile(x, 98), # lambda
                       lambda x: np.percentile(x, 17), # lambda
                      ],
            'distance': ['sum',
                         'mean',
                         smrb.mad, # original function
                         mad_c1,   # partial function wrapping the original function
                        ]
           }

name_dict = {'energy':['energy_sum', 'energy_p98', 'energy_p17'],
             'distance':['distance_sum', 'distance_mean', 'distance_mad', 'distance_mad_c1']}


mydf.groupby('cat').agg(agg_dict, names=name_dict, drop_index=True)

๋˜๋Š” DataFrame.assign ์™€ ์œ ์‚ฌํ•˜๊ฒŒ ์ž‘๋™ํ•˜๋Š” ์™„์ „ํžˆ ์ƒˆ๋กœ์šด agg_assign ๋ฉ”์„œ๋“œ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

mydf.groupby('cat').agg_assign(energy_sum=lambda x: x.energy.sum(),
                               energy_p98=lambda x: np.percentile(x.energy, 98),
                               energy_p17=lambda x: np.percentile(x.energy, 17),
                               distance_sum=lambda x: x.distance.sum(),
                               distance_mean=lambda x: x.distance.mean(),
                               distance_mad=lambda x: smrb.mad(x.distance),
                               distance_mad_c1=lambda x: mad_c1(x.distance))

์ €๋Š” ์‚ฌ์‹ค ์ด ์˜ต์…˜์ด ํ›จ์”ฌ ์ข‹์Šต๋‹ˆ๋‹ค.

๊ทธ๋งŒํ•œ ๊ฐ€์น˜๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‚˜๋Š” ๋˜ํ•œ ๊ธฐ๋Šฅ์„ ํ‰๊ฐ€์ ˆํ•˜ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์— ๊ฐ•๋ ฅํ•˜๊ฒŒ ์ฐฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋‚˜์—๊ฒŒ ํฐ ์ด์œ ๋Š” ํŒŒ์ด์ฌ์˜ ํ•จ์ˆ˜ ์ด๋ฆ„ ๊ณต๊ฐ„(ํŠน์ • ๊ตฌํ˜„๊ณผ ๊ด€๋ จ๋œ ๊ฒƒ)์„ ์—ด ์ด๋ฆ„๊ณผ ๋ฐ์ดํ„ฐ(๊ตฌํ˜„์— ๋Œ€ํ•ด ํ™•์‹คํžˆ ์•Œ์•„์•ผ ํ•˜๋Š” ๊ฒƒ)์™€ ํ˜ผํ•ฉํ•˜๋Š” ๋ฐ ๋งค์šฐ ์ด์ƒํ•œ ์ ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. '<lambda>' ๋ผ๋Š” ์—ด(์—ฌ๋Ÿฌ ์—ด)์ด ํ‘œ์‹œ๋œ๋‹ค๋Š” ์‚ฌ์‹ค์€ ์‹ฌ๊ฐํ•œ ์ธ์ง€ ๋ถ€์กฐํ™”๋ฅผ ์œ ๋ฐœํ•ฉ๋‹ˆ๋‹ค.

๋ถˆํ•„์š”ํ•œ(๋…ธ์ถœ๋œ) ์—ด ์ด๋ฆ„์ด ์‚ฌ์šฉ๋˜๋Š” ์ค‘๊ฐ„ ๋‹จ๊ณ„๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๋ฌธ์ œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ตฌํ˜„์— ์ž ์žฌ์ ์œผ๋กœ ์ข…์†์„ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์•ˆ์ •์ ์ด๊ณ  ์ฒด๊ณ„์ ์œผ๋กœ ์ด๋ฆ„์„ ๋ฐ”๊พธ๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๊ทธ ์™ธ์—๋„ ์ค‘์ฒฉ๋œ dict ๊ธฐ๋Šฅ์€ ํ™•์‹คํžˆ ๋ณต์žกํ•˜์ง€๋งŒ ์ˆ˜ํ–‰ ์ค‘์ธ ๋ณต์žกํ•œ ์ž‘์—…์ž…๋‹ˆ๋‹ค.

TL;DR ํ‰๊ฐ€์ ˆํ•˜ํ•˜์ง€ ๋งˆ์‹ญ์‹œ์˜ค. :)

๋‚˜์˜ ๊ธฐ์—ฌ๋Š” ๋‘ ๊ฐ€์ง€์— ์˜ํ•ด ๋™๊ธฐ๊ฐ€ ๋ถ€์—ฌ๋œ๋‹ค.

  1. ๋‚˜๋Š” Pandas์˜ ๋ถ€ํ’€๋ ค์ง„ API๋ฅผ ์ค„์ด๋ ค๋Š” ๋™๊ธฐ๋ฅผ ์•Œ๊ณ  ์žˆ์œผ๋ฉฐ ์ด์— ๋™์˜ํ•ฉ๋‹ˆ๋‹ค. "๋ถ€ํ’€๋ ค์ง„" API ์š”์†Œ๋ฅผ ์ค„์ด๋ ค๋Š” ์ธ์‹๋œ ๋™๊ธฐ์™€ ๊ด€๋ จํ•˜์—ฌ ์ž˜๋ชป ์•ˆ๋‚ด๋˜์–ด ์žˆ๋”๋ผ๋„ Pandas์˜ API๊ฐ€ ๊ฐ„์†Œํ™”๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ์—ฌ์ „ํžˆ ์ œ ์ƒ๊ฐ์ž…๋‹ˆ๋‹ค.
  2. ๋ชจ๋“  ์‚ฌ๋žŒ์˜ ์š•๊ตฌ์™€ ์š•๊ตฌ๋ฅผ ๋งŒ์กฑ์‹œํ‚ค๋Š” API๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ข‹์€ ์š”๋ฆฌ์ฑ…๊ณผ ์ข‹์€ ๋ ˆ์‹œํ”ผ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ด ๋‚ซ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ค‘์ฒฉ๋œ ์‚ฌ์ „์„ ํ†ตํ•œ ์ด๋ฆ„ ๋ณ€๊ฒฝ์ด ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ๋งŒ์กฑ์Šค๋Ÿฌ์šด ๋ณ€๋•์— ํ•ด๋‹นํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ฉฐ ์šฐ๋ฆฌ๋Š” ์‚ฌ์šฉ ์ค‘๋‹จ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ๊ฒƒ์€ ๊ฐ„์†Œํ™”๋œ API์™€ ๋‹ค๋ฅธ ๊ฒƒ ์‚ฌ์ด์˜ ์ŠคํŽ™ํŠธ๋Ÿผ์— ์žˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ Pandas Series ๋ฐ DataFrame ๊ฐœ์ฒด์—๋Š” ํŒŒ์ดํ”„๋ผ์ด๋‹์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๋Š” pipe ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์„œ ๋ถ€๋ถ„ ์—์„œ๋Š” pipe ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„œ๋ธŒํด๋ž˜์‹ฑ ๋Œ€์‹  ๋ฉ”์„œ๋“œ๋ฅผ ๋Œ€๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ™์€ ์ •์‹ ์œผ๋กœ ์ƒˆ๋กœ์šด GroupBy.pipe ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์œ ์‚ฌํ•œ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  groupby ๊ฐœ์ฒด์— ๋Œ€ํ•œ ํ”„๋ก์‹œ ๋ฉ”์„œ๋“œ๋ฅผ ๋นŒ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@zertrin ์˜ ์˜ˆ๋ฅผ ์‚ฌ์šฉ

import numpy as np
import statsmodels.robust as smrb
from functools import partial

# The DataFrame offered up above
mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6)
)

# Identical dictionary passed to `agg`
funcs = {
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),  # lambda
        'energy_p17': lambda x: np.percentile(x, 17),  # lambda
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,   # original function
        'distance_mad_c1': mad_c1,  # partial function wrapping the original function
    },
}

# Write a proxy method to be passed to `pipe`
def agg_assign(gb, fdict):
    data = {
        (cl, nm): gb[cl].agg(fn)
        for cl, d in fdict.items()
        for nm, fn in d.items()
    }
    return pd.DataFrame(data)

# All the API we need already exists with `pipe`
mydf.groupby('cat').pipe(agg_assign, fdict=funcs)

๊ฒฐ๊ณผ๋Š”

            distance                                                 energy                        
    average_distance distance_mad distance_mad_c1 total_distance energy_p17 energy_p98 total_energy
cat                                                                                                
A              1.480     0.355825           0.240           4.44     1.8510     2.0364         5.79
B              0.915     0.140847           0.095           1.83     1.3095     1.5930         2.85
C              0.600     0.000000           0.000           0.60     1.0100     1.0100         1.01

pipe ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋งŽ์€ ๊ฒฝ์šฐ์— ์ƒˆ API๋ฅผ ์ถ”๊ฐ€ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ๊ฐ€ ๋…ผ์˜ํ•˜๊ณ  ์žˆ๋Š” ๋” ์ด์ƒ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๊ธฐ๋Šฅ์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜๋‹จ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ง€์› ์ค‘๋‹จ์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” tdpetrou์˜ ์•„์ด๋””์–ด๋ฅผ ์ •๋ง ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค - names=name_dict .

์ด๊ฒƒ์€ ๋ชจ๋‘๋ฅผ ํ–‰๋ณตํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์€ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋Œ€๋กœ ์‰ฝ๊ฒŒ ์—ด์˜ ์ด๋ฆ„์„ ๋ฐ”๊ฟ€

์‹ค์ œ๋กœ๋Š” ์ดˆ๊ธฐ ๊ฒŒ์‹œ๋ฌผ์—์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์ง‘๊ณ„ ์ž‘์—…์ด ๊ฒฐ๊ณผ ์—ด์˜ ์ด๋ฆ„์—์„œ ์ •์˜๋œ ์œ„์น˜๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•˜๋ฏ€๋กœ ๋‘˜ ๋‹ค "๋™๊ธฐํ™”"๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€ ๋…ธ๋ ฅ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ๋‚˜์œ ํ•ด๊ฒฐ์ฑ…์ด๋ผ๊ณ  ๋งํ•˜์ง€๋Š” ์•Š์ง€๋งŒ(๊ฒฐ๊ตญ ๋‹ค๋ฅธ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•จ) dict of dict ์ ‘๊ทผ ๋ฐฉ์‹๋งŒํผ ์‰ฝ๊ณ  ๋ช…ํ™•ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋‚ด ๋ง์€ ๊ธ€์„ ์“ธ ๋•Œ ๋ชฉ๋ก์˜ ๋‘ ์‚ฌ์ „์„ ๋™๊ธฐํ™”๋œ ์ƒํƒœ๋กœ ์œ ์ง€ํ•ด์•ผ ํ•˜๊ณ  ์†Œ์Šค๋ฅผ ์ฝ์„ ๋•Œ ๋…์ž๋Š” ๋ชฉ๋ก์˜ ๋‘ ๋ฒˆ์งธ ์‚ฌ์ „์— ์žˆ๋Š” ์ด๋ฆ„์„ ๋ชฉ๋ก์˜ ์ฒซ ๋ฒˆ์งธ ์‚ฌ์ „์— ์žˆ๋Š” ์ง‘๊ณ„ ์ •์˜์™€ ์ผ์น˜์‹œํ‚ค๋ ค๊ณ  ๋…ธ๋ ฅํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์€ ๊ฐ๊ฐ์˜ ๊ฒฝ์šฐ์— ๋‘ ๋ฐฐ์˜ ๋…ธ๋ ฅ์ž…๋‹ˆ๋‹ค.

์ค‘์ฒฉ๋œ ๋”•์…”๋„ˆ๋ฆฌ๋Š” ๋‹ค์†Œ ๋ณต์žกํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ๋ถ„์ด ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ ์ž‘์„ฑํ•˜๋ฉด ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๊ณ  ์žˆ๋Š”์ง€ ๋งค์šฐ ๋ช…ํ™•ํ•ด์ง‘๋‹ˆ๋‹ค.

๋‚˜๋Š” ์•„์ง๋„ ์™œ ๋ชจ๋“  ์‚ฌ๋žŒ๋“ค์ด dict of dict๊ฐ€ ๋ณต์žกํ•˜๋‹ค๊ณ  ๋งํ•˜๋Š”์ง€ ์ดํ•ดํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋‚˜์—๊ฒŒ ๊ทธ๊ฒƒ์€ ๊ทธ๊ฒƒ์„ ํ•˜๋Š” ๊ฐ€์žฅ ๋ช…ํ™•ํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

์ฆ‰, names ํ‚ค์›Œ๋“œ๊ฐ€ pandas ํŒ€์ด ํŽธ์•ˆํ•œ ์œ ์ผํ•œ ์†”๋ฃจ์…˜์ด๋ผ๋ฉด ํ˜„์žฌ ์ƒํ™ฉ์— ๋น„ํ•ด ์—ฌ์ „ํžˆ ๊ฐœ์„ ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@pirsquared ๋Š” ํ˜„์žฌ API๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ํฅ๋ฏธ๋กœ์šด ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‚ด ์˜๊ฒฌ์œผ๋กœ๋Š” ์ดํ•ดํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š์ง€๋งŒ (๋‚˜๋Š” ๊ทธ๊ฒƒ์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ์ •๋ง๋กœ ์ดํ•ดํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค : ํ˜ผ๋ž€ ์Šค๋Ÿฝ์Šต๋‹ˆ๋‹ค : )

๋ฐ์ดํ„ฐ ๊ณผํ•™ ํ•˜์œ„ ๋ ˆ๋”ง์—์„œ ์Šค๋ ˆ๋“œ๋ฅผ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŒฌ๋”์— ๋Œ€ํ•ด ๋ฌด์—‡์„ ์‹ซ์–ดํ•ฉ๋‹ˆ๊นŒ? . ๋ˆ„๊ตฐ๊ฐ€ groupby ๋’ค์— ๋ฐ˜ํ™˜๋œ MultiIndex ์— ๋Œ€ํ•œ ๊ฒฝ๋ฉธ์„ ์ œ๊ธฐํ•˜๊ณ  plydata ์— ๊ตฌํ˜„๋œ dplyr do ๋™์‚ฌ๋ฅผ ๊ฐ€๋ฆฌ์ผฐ์Šต๋‹ˆ๋‹ค . ๊ทธ๊ฒƒ์€ agg_assign ๋กœ ์ •ํ™•ํžˆ ์ž‘๋™ํ•˜๋ฏ€๋กœ ๊ฝค ํฅ๋ฏธ๋กœ์› ์Šต๋‹ˆ๋‹ค.

@zertrin agg_assign dict of dict ์ ‘๊ทผ ๋ฐฉ์‹๋ณด๋‹ค ์šฐ์ˆ˜ํ•˜๊ณ  SQL ์ง‘๊ณ„์™€ ๋™์ผํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ง‘๊ณ„ ๋‚ด์—์„œ ์—ฌ๋Ÿฌ ์—ด์ด ์„œ๋กœ ์ƒํ˜ธ ์ž‘์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. DataFrame.assign ๋™์ผํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

@jreback @TomAugspurger์— ๋Œ€ํ•œ ์ƒ๊ฐ์ด ์žˆ์Šต๋‹ˆ๊นŒ?

...
mydf.groupby('cat').agg(agg_dict, ์ด๋ฆ„=name_dict, drop_index=True)

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์ง€๋งŒ ํ‚ค์™€ ๊ฐ’์„ ๋‘ ์œ„์น˜์— ์ •๋ ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ€๊ธฐ ์ฝ”๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์€ API( .agg_assign ์•ˆ๋จ)๋Š” ์˜ค๋ฅ˜๊ฐ€ ๋œ ๋ฐœ์ƒํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

API ์‚ฌ์šฉ ํ›„ ์ •๋ฆฌ ์ฝ”๋“œ ๋ฌธ์ œ๋„ ์žˆ์Šต๋‹ˆ๋‹ค. groupby ์ž‘์—…์ด MultiIndex ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋ฐ˜ํ™˜ํ•˜๋ฉด ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ์‚ฌ์šฉ์ž๊ฐ€ MultiIndex ์‹คํ–‰ ์ทจ์†Œํ•ฉ๋‹ˆ๋‹ค. .agg_assign ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ง์ ‘์ ์ธ ์„ ์–ธ์  ๋ฐฉ๋ฒ•์€ ๊ณ„์ธต ๊ตฌ์กฐ, MultiIndex ์ถœ๋ ฅ, ๋‚˜์ค‘์— ์ •๋ฆฌํ•˜์ง€ ์•Š์Œ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ ํŒจํ„ด์— ๋”ฐ๋ผ ๋‹ค์ค‘ ์ธ๋ฑ์Šค ์ถœ๋ ฅ์€ ์˜ตํŠธ์•„์›ƒ์ด ์•„๋‹Œ ์˜ตํŠธ์ธ์ด์–ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ฒ˜์Œ์— agg_assign ์ œ์•ˆ์— ๋Œ€ํ•ด ํšŒ์˜์ ์ด์—ˆ์ง€๋งŒ ๋งˆ์ง€๋ง‰ ๋‘ ๊ฐ€์ง€ ์˜๊ฒฌ์„ ๋ณด๊ณ  ์ด๊ฒƒ์ด ์ข‹์€ ํ•ด๊ฒฐ์ฑ…์ด ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ํ™•์‹ ์„ ๊ฐ–๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ agg_assign(**relabeling_dict) ํ˜•์‹์œผ๋กœ ์‚ฌ์šฉํ•  ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•ด ์ƒ๊ฐํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚ด relabeling_dict ๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

relabeling_dict = {
    'energy_sum': lambda x: x.energy.sum(),
    'energy_p98': lambda x: np.percentile(x.energy, 98),
    'energy_p17': lambda x: np.percentile(x.energy, 17),
    'distance_sum': lambda x: x.distance.sum(),
    'distance_mean': lambda x: x.distance.mean(),
    'distance_mad': lambda x: smrb.mad(x.distance),
    'distance_mad_c1': lambda x: mad_c1(x.distance)
}

๊ทธ๊ฒƒ์€ ๋งค์šฐ ์œ ์—ฐํ•˜๊ณ  ๋‚ด OP์—์„œ ์–ธ๊ธ‰ ํ•œ ๋ชจ๋“  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

@zertrin @has2k1

๋‚˜๋Š” ์ด๊ฒƒ์— ๋Œ€ํ•ด ์กฐ๊ธˆ ๋” ์ƒ๊ฐํ•˜๊ณ  ์žˆ์—ˆ๊ณ  ์ด ๊ธฐ๋Šฅ์€ ์ด๋ฏธ apply ํ•ฉ๋‹ˆ๋‹ค. ์ธ๋ฑ์Šค๋ฅผ ์ƒˆ ์—ด ์ด๋ฆ„์œผ๋กœ, ๊ฐ’์„ ์ง‘๊ณ„๋กœ ์‚ฌ์šฉํ•˜์—ฌ Series๋ฅผ ๋ฐ˜ํ™˜ํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์ด๋ฆ„์— ๊ณต๋ฐฑ์„ ํ—ˆ์šฉํ•˜๊ณ  ์›ํ•˜๋Š” ๋Œ€๋กœ ์ •ํ™•ํžˆ ์—ด์„ ์ •๋ ฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

def my_agg(x):
    data = {'energy_sum': x.energy.sum(),
            'energy_p98': np.percentile(x.energy, 98),
            'energy_p17': np.percentile(x.energy, 17),
            'distance sum' : x.distance.sum(),
            'distance mean': x.distance.mean(),
            'distance MAD': smrb.mad(x.distance),
            'distance MAD C1': mad_c1(x.distance)}
    return pd.Series(data, index=list_of_column_order)

mydf.groupby('cat').apply(my_agg)

๋”ฐ๋ผ์„œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋Œ€์‹  ๋ฌธ์„œ์—์„œ ๋” ๋‚˜์€ ์˜ˆ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

@tdpetrou , ๋‹น์‹ ์ด ๋งž์Šต๋‹ˆ๋‹ค. ๋‚ด ์ž์‹ ์˜ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜๋Š” ๋™์•ˆ ๋น ๋ฅธ ๋Š๋ฆฐ ๊ฒฝ๋กœ ์„ ํƒ ํ”„๋กœ์„ธ์Šค์˜ ์ด์ค‘ ์‹คํ–‰์œผ๋กœ ์ธํ•ด apply ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹์„ ์žŠ์—ˆ์Šต๋‹ˆ๋‹ค.

ํ , ๋ฌธ์„œ๋ฅผ ์ฝ๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋„ ์ง‘๊ณ„ ์ปจํ…์ŠคํŠธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•ด ์ƒ๊ฐํ•  ๊ธฐํšŒ๋Š” ์—†์—ˆ์ง€๋งŒ...
๊ฒŒ๋‹ค๊ฐ€, ๋‚˜๋Š” ์—ฌ์ „ํžˆ apply ์˜ ์†”๋ฃจ์…˜์ด ๋„ˆ๋ฌด ๋ณต์žกํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. agg_assign ์ ‘๊ทผ ๋ฐฉ์‹์ด ๋” ๊ฐ„๋‹จํ•˜๊ณ  ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์›Œ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๊ฒƒ์— ๋Œ€ํ•ด ์‹ค์ œ๋กœ ์ง„์ˆ ํ•œ ์ ์ด ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์— dict-of-dict ์ ‘๊ทผ ๋ฐฉ์‹(ํ˜„์žฌ ์‚ฌ์šฉ๋˜์ง€๋Š” ์•Š์ง€๋งŒ ์ด๋ฏธ ๊ตฌํ˜„๋˜์—ˆ์œผ๋ฉฐ ์ด๋Ÿฌํ•œ ๋ชจ๋“  ๋ฌธ์ œ๋„ ํ•ด๊ฒฐํ•จ)์ด ์ •๋ง ํ™•์‹คํ•˜๊ฒŒ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์Šต๋‹ˆ๊นŒ?

agg_assign ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์™ธํ•˜๊ณ ๋Š” dict-of-dict ์—ฌ์ „ํžˆ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ด๋ฉฐ ์ฝ”๋”ฉ์ด ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฉฐ ๋‹จ์ง€ ๋น„์ถ”์ฒœ์ž…๋‹ˆ๋‹ค.

agg_assign ์ ‘๊ทผ ๋ฐฉ์‹์˜ ์žฅ์ ๊ณผ ๋‹จ์ ์€ ์—ด ์„ ํƒ ์„ ์ง‘๊ณ„ ๋ฐฉ๋ฒ•์œผ๋กœ ํ‘ธ์‹œํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ์˜ˆ์ œ์—์„œ, x ์— ์ „๋‹ฌ lambda ์ฒ˜๋Ÿผ ๋ญ”๊ฐ€ self.get_group(group) ๊ฐ ๊ทธ๋ฃน์— ๋Œ€ํ•ด self ํ•˜๋Š” DataFrameGroupBy ๊ฐ์ฒด์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ **kwargs ์žˆ๋Š” ๋„ค์ด๋ฐ์„ ํ•จ์ˆ˜์— ์žˆ๋Š” ์„ ํƒ ๊ณผ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ถ„๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ข‹์Šต๋‹ˆ๋‹ค.

๋‹จ์ ์€ ๋ฉ‹์ง€๊ณ  ์ผ๋ฐ˜์ ์ธ ์ง‘๊ณ„ ํ•จ์ˆ˜๊ฐ€ ์ด์ œ ์—ด ์„ ํƒ๊ณผ ๊ด€๋ จ๋˜์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ณต์งœ ์ ์‹ฌ์€ ์—†๋‹ค! ์ฆ‰, lambda x: x[col].min ์™€ ๊ฐ™์€ ๋งŽ์€ ๋„์šฐ๋ฏธ๊ฐ€ ์ƒ๊น๋‹ˆ๋‹ค. ๋‹น์‹ ์€ ๋˜ํ•œ ๊ฐ™์€ ๊ฒƒ๋“ค์—์ฃผ์˜ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค np.min ๋Œ€, ๋ชจ๋“  ์ฐจ์›์— ๊ฑธ์ณ ๊ฐ์†Œํ•˜๋Š” pd.DataFrame.min ์ด์ƒ ๊ฐ์†Œํ•˜๋Š” axis=0 . ์ด๊ฒƒ์ด agg_assign ์™€ ๊ฐ™์€ ๊ฒƒ์ด apply ์™€ ๊ฐ™์ง€ ์•Š์€ ์ด์œ ์ž…๋‹ˆ๋‹ค. apply ์—ฌ์ „ํžˆ ํŠน์ • ๋ฉ”์„œ๋“œ์— ๋Œ€ํ•ด ์—ด ๋‹จ์œ„๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์ ˆ์ถฉ์ ๊ณผ dict-of-dicts ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ๋Š” ์ž˜ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ๋‹ค๋ฅธ ์‚ฌ๋žŒ๋“ค์˜ ์ƒ๊ฐ์ด ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์€ agg_assign ์˜ ๋Œ€๋žต์ ์ธ ์Šค์ผ€์น˜์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ agg_table ๋ผ๊ณ  ๋ถˆ๋ €์Šต๋‹ˆ๋‹ค. ํ•จ์ˆ˜๊ฐ€ ์—ด์ด ์•„๋‹ˆ๋ผ ํ…Œ์ด๋ธ”๋กœ ์ „๋‹ฌ๋œ๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•˜๊ธฐ ์œ„ํ•ด ํ˜ธ์ถœํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

from collections import defaultdict

import pandas as pd
import numpy as np
from pandas.core.groupby import DataFrameGroupBy

mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6)
)


def agg_table(self, **kwargs):
    output = defaultdict(dict)
    for group in self.groups:
        for k, v in kwargs.items():
            output[k][group] = v(self.get_group(group))

    return pd.concat([pd.Series(output[k]) for k in output],
                     keys=list(output),
                     axis=1)

DataFrameGroupBy.agg_table = agg_table

์šฉ๋ฒ•

>>> gr = mydf.groupby("cat")
>>> gr.agg_table(n=len,
                 foo=lambda x: x.energy.min(),
                 bar=lambda y: y.distance.min())

   n   foo   bar
A  3  1.80  1.20
B  2  1.25  0.82
C  1  1.01  0.60

์ด ์„ฑ๋Šฅ์„ ๋œ ๋”์ฐํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๊ฐ€ ์•ฝ๊ฐ„ ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ง€๋งŒ .agg ๋งŒํผ์€ ์•„๋‹™๋‹ˆ๋‹ค...

Pandas Core Team์˜ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ groupby.agg ์—์„œ dicts์˜ ๋ ˆ์ด๋ธ”์„ ๋‹ค์‹œ ์ง€์ •ํ•˜์ง€ ์•Š๋Š” ์ฃผ๋œ ์ด์œ ๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์„ค๋ช…ํ•ด ์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?

์ฝ”๋“œ๋ฅผ ์œ ์ง€ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐ ๋„ˆ๋ฌด ๋งŽ์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š”์ง€ ์‰ฝ๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ตœ์ข… ์‚ฌ์šฉ์ž์˜ ๋ณต์žก์„ฑ์— ๊ด€ํ•œ ๊ฒƒ์ด๋ผ๋ฉด ํ•„์š”ํ•œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ๋งค์šฐ ๋ช…ํ™•ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์‹œ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ๋„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค...

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!

Pandas Core Team์˜ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ groupby.agg์—์„œ dicts์˜ ๋ ˆ์ด๋ธ”์„ ๋‹ค์‹œ ์ง€์ •ํ•˜์ง€ ์•Š๋Š” ์ฃผ๋œ ์ด์œ ๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์„ค๋ช…ํ•ด ์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?

https://github.com/pandas-dev/pandas/pull/15931/files#diff -52364fb643114f3349390ad6bcf24d8fR461 ๋ณด์…จ๋‚˜์š”?

์ฃผ๋œ ์ด์œ ๋Š” dict-key๊ฐ€ ๋‘ ๊ฐ€์ง€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ์˜ค๋ฒ„๋กœ๋“œ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. Series / SeriesGroupBy์˜ ๊ฒฝ์šฐ ์ด๋ฆ„ ์ง€์ •์šฉ์ž…๋‹ˆ๋‹ค. DataFrame/DataFrameGroupBy์˜ ๊ฒฝ์šฐ ์—ด์„ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

In [32]: mydf.aggregate({"distance": "min"})
Out[32]:
distance    0.6
dtype: float64

In [33]: mydf.aggregate({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[33]:
     distance
foo       0.6

In [34]: mydf.distance.agg({"foo": "min"})
Out[34]:
foo    0.6
Name: distance, dtype: float64

In [35]: mydf.groupby("cat").agg({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py:4201: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[35]:
    distance
         foo
cat
A       1.20
B       0.82
C       0.60

In [36]: mydf.groupby("cat").distance.agg({"foo": "min"})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[36]:
      foo
cat
A    1.20
B    0.82
C    0.60

์ด๊ฒƒ์€ ํŒ๋‹ค์—์„œ ์•„๋งˆ๋„ ๊ฐ€์žฅ ํ˜ผ๋ž€์Šค๋Ÿฌ์šด ๊ฒƒ์ด ์•„๋‹ˆ๋ฏ€๋กœ ์•„๋งˆ๋„ ๋‹ค์‹œ ๋ฐฉ๋ฌธํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. :) ๋‚˜๋Š” ์•„๋งˆ๋„ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์™ธ์ ์ธ ๊ฒฝ์šฐ๋ฅผ ๋†“์น˜๊ณ  ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ dict-of-dicts ์ง‘๊ณ„๋ฅผ ์ œ๊ฑฐํ•˜๋”๋ผ๋„ ์ด๋ฆ„ ์ง€์ •๊ณผ ์—ด ์„ ํƒ ๊ฐ„์— ์—ฌ์ „ํžˆ ๋ถˆ์ผ์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Series / SeriesGroupBy ์‚ฌ์ „ ํ‚ค๋Š” ํ•ญ์ƒ ์ถœ๋ ฅ ์ด๋ฆ„์„ ์ง€์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

DataFrame / DataFrameGroupby์˜ ๊ฒฝ์šฐ ์‚ฌ์ „ ํ‚ค๋Š” ํ•ญ์ƒ ์„ ํƒ์šฉ์ž…๋‹ˆ๋‹ค. dict-of-dicts๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ด์„ ์„ ํƒํ•œ ๋‹ค์Œ ๋‚ด๋ถ€ dict๋Š” Series / SeriesGroupBy์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ถœ๋ ฅ ์ด๋ฆ„์„ ์ง€์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ์ด์ „์— ์ด๊ฒƒ์„ ๊ฐ„๋žตํ•˜๊ฒŒ ๋…ผ์˜ํ–ˆ์œผ๋ฉฐ(์ง€์› ์ค‘๋‹จ์— ๋Œ€ํ•œ ๊ธด ํ† ๋ก ์˜ ์–ด๋”˜๊ฐ€) ์—ฌ๊ธฐ์—์„œ ๋น„์Šทํ•œ ๊ฒƒ์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค: https://github.com/pandas-dev/pandas/pull/14668#issuecomment -274508089. ๊ทธ๋Ÿฌ๋‚˜ ๊ฒฐ๊ตญ์—๋Š” ์‚ฌ์šฉ ์ค‘๋‹จ๋งŒ ๊ตฌํ˜„๋˜์—ˆ์œผ๋ฉฐ dicts('์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ' ๊ธฐ๋Šฅ)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ ๊ธฐ๋Šฅ์„ ๋” ์‰ฝ๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ์•„์ด๋””์–ด๋Š” ๊ตฌํ˜„๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

๋ฌธ์ œ๋Š” ๋”•์…”๋„ˆ๋ฆฌ๊ฐ€ '์„ ํƒ'(์–ด๋–ค ์—ด์— ์ด ๊ธฐ๋Šฅ์„ ์ ์šฉํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ)๊ณผ '์ด๋ฆ„ ๋ณ€๊ฒฝ'(์ด ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•  ๋•Œ ๊ฒฐ๊ณผ ์—ด ์ด๋ฆ„์ด ๋˜์–ด์•ผ ํ•จ)์— ๋ชจ๋‘ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ agg_assign ์ œ์•ˆ์—์„œ ๋…ผ์˜๋œ ๊ฒƒ์ฒ˜๋Ÿผ dicts์™€ ๋ณ„๊ฐœ๋กœ ๋Œ€์ฒด ๊ตฌ๋ฌธ์€ ํ‚ค์›Œ๋“œ ์ธ์ˆ˜๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
agg ์ž์ฒด์— ์žˆ๋“  agg_assign ์™€ ๊ฐ™์€ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์— ์žˆ๋“  ๋‚˜๋Š” ์—ฌ์ „ํžˆ ์ด ๊ฐ€๋Šฅ์„ฑ์„ ํƒ๊ตฌํ•˜๋Š” ๋ฐ ์ฐฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋•Œ ๋‚ด๊ฐ€ ์ œ์•ˆํ•œ ๊ฒƒ์€ agg_assign ์™€ ๋น„์Šทํ•˜์ง€๋งŒ ๋žŒ๋‹ค ํ•จ์ˆ˜ ๋Œ€์‹  ํ‚ค์›Œ๋“œ๋‹น dict๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ ์˜ˆ์ œ๋กœ ๋ฒˆ์—ญํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

mydf.groupby('cat').agg(
    energy_sum={'energy': 'sum'},
    energy_p98={'energy': lambda x: np.percentile(x, 98)},
    energy_p17={'energy': lambda x: np.percentile(x, 17)},
    distance_sum={'distance': 'sum'},
    distance_mean={'distance': 'mean'},
    distance_mad={'distance': smrb.mad},
    distance_mad_c1={'distance': mad_c1})

์ด๊ฒƒ์ด ๋ฐ˜๋“œ์‹œ ๋ชจ๋“  ๋žŒ๋‹ค๊ฐ€ ์žˆ๋Š” ๋ฒ„์ „์œผ๋กœ ๋” ์ฝ๊ธฐ ์‰ฝ๊ณ  ์“ฐ๊ธฐ ์‰ฌ์šด์ง€๋Š” ํ™•์‹คํ•˜์ง€ ์•Š์ง€๋งŒ, pandas๊ฐ€ ์—ฌ์ „ํžˆ ํ•ด๋‹น ์—ด์—์„œ ํ•ฉ๊ณ„, ํ‰๊ท  ๋“ฑ์— ๋Œ€ํ•ด ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋ฒ„์ „์ด ์ž ์žฌ์ ์œผ๋กœ ๋” ์„ฑ๋Šฅ์ด ์ข‹์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋žŒ๋‹ค ๋˜๋Š” ์‚ฌ์šฉ์ž ์ง€์ • ํ•จ์ˆ˜๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

์ด ์ ‘๊ทผ ๋ฐฉ์‹์— ๋Œ€ํ•œ ํฐ ์งˆ๋ฌธ์€ df.groupby('cat').agg(foo='mean') ๊ฐ€ ๋ฌด์—‡์„ ์˜๋ฏธํ• ๊นŒ์š”? ์„ ํƒํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ชจ๋“  ์—ด์— 'ํ‰๊ท '์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค(์ด์ „ {'col1' : {'foo': 'mean'}, 'col2': {'foo':'mean'}, 'col3': ...} ์™€ ์œ ์‚ฌ). ๊ทธ๋Ÿฌ๋‚˜ ๊ทธ ๊ฒฐ๊ณผ ๋‹ค์ค‘ ์ธ๋ฑ์‹ฑ๋œ ์—ด์ด ์ƒ์„ฑ๋˜์ง€๋งŒ ์œ„์˜ ์˜ˆ์—์„œ๋Š” MI ์—ด๋กœ ๋๋‚˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์œ„์˜ ์ž‘์—…์€ ๊ธฐ์กด agg ๋‚ด๋ถ€์—์„œ ์ด์ „ ๋ฒ„์ „๊ณผ ํ˜ธํ™˜๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ง€๋งŒ ์ด๊ฒƒ์ด ํ•„์š”ํ•œ์ง€ ์—ฌ๋ถ€๊ฐ€ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.
๋‚˜๋Š” ๋˜ํ•œ ์ด๊ฒƒ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ series ์‚ฌ๋ก€๋กœ ๋ฉ‹์ง€๊ฒŒ ํ™•์žฅ๋  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

mydf.groupby('cat').distance.agg(
    distance_sum='sum',
    distance_mean='mean',
    distance_mad=smrb.mad,
    distance_mad_c1=mad_c1)

(๊ทธ๋ฆฌ๊ณ  '๊ฑฐ๋ฆฌ'์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ, '์—๋„ˆ์ง€'์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ ์œ„์˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๋ชจ๋“  ๋”•์…”๋„ˆ๋ฆฌ/๋žŒ๋‹ค๊ฐ€ ๋งˆ์Œ์— ๋“ค์ง€ ์•Š์œผ๋ฉด ๊ฒฐ๊ณผ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค)

@TomAugspurger agg_table ์˜ ๊ฐ„๋‹จํ•œ ๊ตฌํ˜„ ์˜ˆ์—์„œ ๊ทธ๋ฃน์„ ๋ฐ˜๋ณตํ•˜๋Š” ๋Œ€์‹  ์ ์šฉํ•  ๋‹ค๋ฅธ ๊ธฐ๋Šฅ์„ ๋ฐ˜๋ณตํ•˜๊ณ  ๊ฒฐ๊ตญ์—๋Š” axis=1๋กœ ์ƒˆ ์—ด์„ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฒƒ์ด ๋” ๋‚ซ์ง€ ์•Š์„๊นŒ์š”? ์ƒˆ๋กœ ํ˜•์„ฑ๋œ ํ–‰์„ axis=0 ์œผ๋กœ ์—ฐ๊ฒฐํ•˜๋Š” ๋Œ€์‹  ?

BTW, @zertrin @tdpetrou @smcateer @pirsquared ๋“ฑ, ์ด ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•˜๊ณ  ์ž์„ธํ•œ ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ”ผ๋“œ๋ฐฑ๊ณผ ์ปค๋ฎค๋‹ˆํ‹ฐ ์ฐธ์—ฌ๋Š” ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค!

์ €๋Š” ์‹ค์ œ๋กœ @tdpetrou๊ฐ€ ์ œ์•ˆํ•œ ํŒจํ„ด์„ ์ •๋ง ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค(์‹œ๋ฆฌ์ฆˆ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜์™€ ํ•จ๊ป˜ ์ ์šฉ ์‚ฌ์šฉ) - ์•„๋งˆ๋„ dicts๋ณด๋‹ค ๋” ๋‚˜์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•จ์ˆ˜๊ฐ€ pd.Series(data, index=data.keys()) ๋ฐ˜ํ™˜ํ•˜๋ฉด ์ธ๋ฑ์Šค๋ฅผ ์˜ฌ๋ฐ”๋ฅธ ์ˆœ์„œ๋กœ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? (๋‚ด ์ฝ”๋“œ์—์„œ ํŒจํ„ด์„ ๊ฐ€์žฅ ์ž˜ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ - ์ฃผ์ œ์—์„œ ๋ฒ—์–ด๋‚  ์œ„ํ—˜์ด ์žˆ์Œ).

ํŽธ์ง‘: ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค. ์ธ๋ฑ์Šค ์ธ์ˆ˜์˜ ์š”์ ์„ ์ž˜๋ชป ์ดํ•ดํ–ˆ์Šต๋‹ˆ๋‹ค(์—ฌ๊ธฐ์„œ๋Š” ์„ ํƒ ์‚ฌํ•ญ์ด๋ฉฐ ์—ด์˜ ์ˆœ์„œ๋ฅผ ์ง€์ •ํ•˜๋ ค๋Š” ๊ฒฝ์šฐ์—๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. pd.Series(data) ๋ฐ˜ํ™˜ํ•˜๋ฉด ์ž‘์—…์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค).

@tdpetrou ์˜ ์˜ˆ์ œ๋Š” first & last ์ง‘๊ณ„์™€ ํ•จ๊ป˜ ์ž‘๋™ํ•ฉ๋‹ˆ๊นŒ?

๋‚˜๋Š” ์ด๋ ‡๊ฒŒ ๋จธ๋ฆฌ / ๊ผฌ๋ฆฌ์— ์˜์ง€ํ•ด์•ผํ–ˆ์Šต๋‹ˆ๋‹ค.

def agg_funcs(x):
    data = {'start':x['DATE_TIME'].head(1).values[0],
           'finish':x['DATE_TIME'].tail(1).values[0],
           'events':len(x['DATE_TIME'])}
    return pd.Series(data, index = list(data.keys()))

results = df.groupby('col').apply(agg_funcs)

๋‚˜๋Š” ์—ฌ์ „ํžˆ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  ์‹ถ์ง€๋งŒ 0.23์—์„œ๋Š” ๋๋‚˜์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

@tdpetrou ์˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ฝ”๋“œ์—์„œ ๋‹ค์‹œ๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜์ง€ ์•Š๊ณ  ์ž‘๋™ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? Q/Kdb+ ์„ธ๊ณ„(SQL๊ณผ ์œ ์‚ฌ)์—์„œ ์™”๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ„๋‹จํ•œ ์„ ํƒ ๋ฌธ์— ๋Œ€ํ•ด ์ž„์‹œ ๋ณ€์ˆ˜/ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋Š” ์ด์œ ๊ฐ€ ํ˜ผ๋ž€์Šค๋Ÿฝ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ OP.

์†”์งํžˆ ๋งํ•ด์„œ, ์ด ๋ชจ๋“  ์‹œ๊ฐ„๊ณผ #15931 ๋ฐ ์—ฌ๊ธฐ์—์„œ ๋งŽ์€ ํ† ๋ก ์„ ํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋‚˜๋Š” ์ด๊ฒƒ์ด relabeling dicts๋ฅผ ๋” ์ด์ƒ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹์€ ์ƒ๊ฐ์ธ์ง€ ์—ฌ์ „ํžˆ ํ™•์‹ ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๊ตญ, ์—ฌ๊ธฐ์— ์ œ์•ˆ๋œ ๋Œ€์•ˆ ์ค‘ ์–ด๋–ค ๊ฒƒ๋„ ํ˜„์žฌ relabeling dict ์ ‘๊ทผ ๋ฐฉ์‹ IMHO๋ณด๋‹ค ์‚ฌ์šฉ์ž์—๊ฒŒ ๋” ์ง๊ด€์ ์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ์— ์žˆ์—ˆ์„ ๋•Œ ํ•œ ๊ฐ€์ง€ ์˜ˆ๋งŒ ๋“ค์–ด๋„ ์ด๊ฒƒ์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ๋ช…ํ™•ํ–ˆ๊ณ  ๋งค์šฐ ์œ ์—ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ฌผ๋ก  ํŒฌ๋” ๊ฐœ๋ฐœ์ž๋Š” ์—ฌ์ „ํžˆ ์‚ฌ์šฉ์ž์˜ ๊ด€์ ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ƒ๊ฐ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

relabeling dict ์ ‘๊ทผ ๋ฐฉ์‹์กฐ์ฐจ๋„ ๋งค์šฐ ์ง๊ด€์ ์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ œ ์ƒ๊ฐ์—๋Š” ๊ตฌ๋ฌธ์ด SQL - func(column_name) as new_column_name ์™€ ์œ ์‚ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Python์—์„œ๋Š” 3๊ฐœ ํ•ญ๋ชฉ ํŠœํ”Œ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (func, column_name, new_column_name) . ์ด๊ฒƒ์ด dexplo๊ฐ€ groupby ์ง‘๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

dexplo

@zertrin ์œ„์˜ ๋‚ด ์ œ์•ˆ์— ๋Œ€ํ•œ ํ”ผ๋“œ๋ฐฑ์ด ์žˆ์Šต๋‹ˆ๊นŒ: https://github.com/pandas-dev/pandas/issues/18366/#issuecomment -349089667
๊ฒฐ๊ตญ "{col: {name: func}}" ๋Œ€์‹  "**{name: {col: func}}"์™€ ๊ฐ™์ด ์‚ฌ์ „์˜ ์ˆœ์„œ๋ฅผ ๋’ค์ง‘์Šต๋‹ˆ๋‹ค.

@jorisvandenbossche ๊ท€ํ•˜์˜ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๊ณ ๋ คํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์ œ๋Š” ํ˜„์žฌ ์ ‘๊ทผ ๋ฐฉ์‹์— ๋น„ํ•ด ์–ด๋–ค ์ถ”๊ฐ€ ์ด์ ์ด ์žˆ๋Š”์ง€ ์‹ค์ œ๋กœ ์•Œ์ง€ ๋ชปํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋” ์ง์„ค์ ์œผ๋กœ ๋งํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ ํƒ์ด ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค.

  1. ์ž˜ ์ž‘๋™ํ•˜๋Š” ํ˜„์žฌ ๋™์ž‘์˜ ์‚ฌ์šฉ ์ค‘๋‹จ์„ ์ทจ์†Œํ•ฉ๋‹ˆ๋‹ค(์‚ฌ์šฉ ์ค‘๋‹จ ์ฝ”๋“œ ๋ช‡ ์ค„์„ ์ œ๊ฑฐํ•˜๊ณ  ์ œ๊ฑฐ๋œ ๋ฌธ์„œ ๋ถ€๋ถ„์„ ๋‹ค์‹œ ์ถ”๊ฐ€).
  2. ์ œ์•ˆ์„ ๊ตฌํ˜„ํ•˜์‹ญ์‹œ์˜ค(์ฝ”๋“œ์— ์ƒ๋‹นํ•œ ๋ณ€๊ฒฝ์ด ์žˆ์„ ๊ฒƒ, ํ˜„์žฌ ์ ‘๊ทผ ๋ฐฉ์‹์˜ ๋น„์ถ”์ฒœ์œผ๋กœ ์ถ”๊ตฌ, ๋ชจ๋“  ์‚ฌ์šฉ์ž๊ฐ€ ์ž์‹ ์˜ ์ฝ”๋“œ๋ฅผ ์กฐ์ •ํ•ด์•ผ ํ•จ)

๊ฐœ๋ฐœ์ž์™€ ์‚ฌ์šฉ์ž์˜ ๊ด€์ ์—์„œ ์˜๋ฏธ ์žˆ๊ณ  ์‹ค์งˆ์ ์ธ ์ด์ ์„ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š” ํ•œ ์™œ 2๋ฅผ ์„ ํƒํ•ด์•ผ ํ•˜๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค.

์œ„ ์ œ์•ˆ์„œ์˜ ๋ช‡ ๊ฐ€์ง€ ์š”์ ์„ ํ•ด๊ฒฐํ•˜๋ ค๋ฉด:

๋ฌธ์ œ๋Š” ๋”•์…”๋„ˆ๋ฆฌ๊ฐ€ '์„ ํƒ'(์–ด๋–ค ์—ด์— ์ด ๊ธฐ๋Šฅ์„ ์ ์šฉํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ)๊ณผ '์ด๋ฆ„ ๋ณ€๊ฒฝ'(์ด ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•  ๋•Œ ๊ฒฐ๊ณผ ์—ด ์ด๋ฆ„์ด ๋˜์–ด์•ผ ํ•จ)์— ๋ชจ๋‘ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด์ „์— ์ž˜ ๋ฌธ์„œํ™”๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉ์ž์—๊ฒŒ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ๊ฐœ์ธ์ ์œผ๋กœ ์„ค๋ช…์„œ์˜ ์˜ˆ์ œ๋ฅผ ๋ณด๊ณ  ์ฆ‰์‹œ ์š”์ ์„ ์•Œ์•˜์Šต๋‹ˆ๋‹ค. (ํŽธ์ง‘: ๊ทธ๋ฆฌ๊ณ  ๋‚˜๋Š” ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค: _"์˜ˆ! ๋งค์šฐ ์œ ์šฉํ•œ ๊ตฌ์กฐ, ๋‚ด๊ฐ€ ์ฐพ๋˜ ๊ฒƒ๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ข‹์Šต๋‹ˆ๋‹ค."_)

dicts์™€ ๋ณ„๊ฐœ๋กœ ๋Œ€์ฒด ๊ตฌ๋ฌธ์€ ํ‚ค์›Œ๋“œ ์ธ์ˆ˜์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

dict-of-dict ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š” ๋งค๋ ฅ์ ์ธ ์  ์ค‘ ํ•˜๋‚˜๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ๋‹ค๋ฅธ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‰ฝ๊ฒŒ ๋™์ ์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ํ•ญ๋ชฉ ๋ฐ”๋กœ ์œ„์˜ ์ฃผ์„์—์„œ ์ง€์ ํ–ˆ๋“ฏ์ด ์ œ์•ˆ์—์„œ์™€ ๊ฐ™์ด ํ‚ค์›Œ๋“œ ์ธ์ˆ˜๋กœ ์ด๋™ํ•˜๋ฉด **{name: {col: func}} ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์—ฌ์ „ํžˆ ์ด๋ฅผ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋‚˜๋Š” ๋‹น์‹ ์˜ ์ œ์•ˆ์— ๋ฐ˜๋Œ€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ๊ตฌํ˜„๋œ ์‹œ์Šคํ…œ์œผ๋กœ ์ด๋ฏธ ๋™์ผํ•œ ์ˆ˜์ค€์˜ ๊ธฐ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์„ ๋•Œ ๋ถ€๊ฐ€ ๊ฐ€์น˜์™€ ๊ทธ๋Ÿฌํ•œ ๋ณ€๊ฒฝ์˜ ํ•„์š”์„ฑ์„ ์•Œ์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๊ตญ, pandas core dev๊ฐ€ ํ˜„์žฌ ์ ‘๊ทผ ๋ฐฉ์‹์— ๋Œ€ํ•ด ๊ฐ•ํ•œ ๋Š๋‚Œ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด ๋‹น์‹ ์˜ ์ œ์•ˆ์€ _๊ดœ์ฐฎ์Šต๋‹ˆ๋‹ค_. ์ €๋Š” _user_๋กœ์„œ ์–ด๋–ค ์ด์ ๋„ ๋Š๋ผ์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. (์‚ฌ์‹ค ๋‚˜๋Š” ๋ชจ๋“  ๊ธฐ์กด ์‚ฌ์šฉ์ž ์ฝ”๋“œ๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ œ์•ˆ์—์„œ ๋‹ค์‹œ ์ž‘๋™ํ•˜๋„๋ก ํ•˜๋Š” ๋‹จ์ ์„ ๋ด…๋‹ˆ๋‹ค.)

@zertrin ์šฐ๋ฆฌ๋Š” ์–ด์ œ ์ผ๋ถ€ ํ•ต์‹ฌ ๊ฐœ๋ฐœ์ž์™€ ์ด์— ๋Œ€ํ•ด ๋…ผ์˜ํ–ˆ์ง€๋งŒ ์—ฌ๊ธฐ์— ์š”์•ฝ์„ ์ž‘์„ฑํ•˜์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ด์ œ ๋‚˜๋Š” ๋‹น์‹ ์˜ ์˜๊ฒฌ์— ๋‹ตํ•˜๊ธฐ ์ „์— ์–ด์ œ์˜ ์šฐ๋ฆฌ ์ƒ๊ฐ์„ ๋ฐ˜์˜


๋”ฐ๋ผ์„œ ๋จผ์ € SQL "SELECT avg(col2) as col2_avg"์™€ ๊ฐ™์€ ๊ธฐ๋ณธ ๊ธฐ๋Šฅ์ด ์ž‘๋™ํ•˜๊ณ  ์‰ฌ์›Œ์•ผ ํ•œ๋‹ค๋Š” ๊ฐœ๋…์€ ์ „์ ์œผ๋กœ ๋™์˜ํ•˜๋ฉฐ ์ด์— ๋Œ€ํ•œ ์†”๋ฃจ์…˜์„ ๊ฐ–๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ๋‹น์‹ ์ด ์‹ค์ œ๋กœ ์‹ถ์ง€ ์•Š์•„ํ•˜๋Š” MultiIndex์„ ๋งŒ๋“ค์–ด์œผ๋กœ ๋ณ„๋„๋กœ ์›๋ž˜์˜ ์ด์œ ์—์„œ ์šฐ๋ฆฌ๋Š” dicts์˜ ํ˜„์žฌ (์‚ฌ์šฉ๋˜์ง€ ์•Š์Œ) dicts๋„์—†๋Š” ๊ทธ ์ด์ƒ์ ์ž…๋‹ˆ๋‹ค (๋˜๋Š” ๊ฐ•ํ•œํ•˜์ง€ ์•Š์„ ์ˆ˜์žˆ๋‹ค)์ด ๋” ์ด์ƒ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ๋‹ค :

In [1]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': range(3), 'C': [.1, .2, .3]})

In [3]: gr = df.groupby('A')

In [4]: gr.agg({'B': {'b_sum': 'sum'}, 'C': {'c_mean': 'mean', 'c_count': 'count'}})
Out[4]: 
        C            B
  c_count c_mean b_sum
A                     
a       2    0.2     2
b       1    0.2     1

์œ„์˜ ๊ฒฝ์šฐ MultiIndex์˜ ์ฒซ ๋ฒˆ์งธ ์ˆ˜์ค€์€ ๋ถˆํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ ๊ตฌ์ฒด์ ์œผ๋กœ ์—ด ์ด๋ฆ„์„ ๋ณ€๊ฒฝํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค(OP์˜ ์˜ˆ์—์„œ ์—ด์˜ ์ฒซ ๋ฒˆ์งธ ์ˆ˜์ค€์„ ์‚ญ์ œํ•œ ๋’ค์—๋„ ๋ฐ”๋กœ ์ด์–ด์ง).
๊ทธ๋Ÿฌ๋‚˜ MultiIndex๊ฐ€ ํ•„์š”ํ•˜๊ณ  ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” gr.agg(['sum', 'mean']) ๋˜๋Š” (ํ˜ผํ•ฉ) gr.agg({'B': ['sum', 'mean'], 'C': {'c_mean': 'mean', 'c_count': 'count'}}) ์™€ ๊ฐ™์€ ์ž‘์—…๋„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์œ„์˜ ๋…ผ์˜์—์„œ ์–ธ๊ธ‰ํ•œ ์ œ์•ˆ ์ค‘ ํ•˜๋‚˜๋Š” ์ตœ์ข… ์—ด ์ด๋ฆ„์„ ๋ณ„๋„๋กœ ์ง€์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด์—ˆ์Šต๋‹ˆ๋‹ค(์˜ˆ: https://github.com/pandas-dev/pandas/issues/18366#issuecomment-346683449).
์˜ˆ๋ฅผ ๋“ค์–ด aggregate ์— ์ถ”๊ฐ€ ํ‚ค์›Œ๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์—ด ์ด๋ฆ„์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

gr.agg({'B': 'sum', 'C': ['mean', 'count']}, columns=['b_sum', 'c_mean', 'c_count'])

๊ฐ€๋Šฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ์—ด/ํ•จ์ˆ˜ ์‚ฌ์–‘๊ณผ ์ƒˆ ์—ด ์ด๋ฆ„์„ ๋ถ„ํ• ํ•˜๋ฉด ์ด๋ฅผ ์ƒˆ ํ‚ค์›Œ๋“œ๋ณด๋‹ค ๋” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŒ๋“ค๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

gr.agg({'B': 'sum', 'C': ['mean', 'count']}).rename(columns=['b_sum', 'c_mean', 'c_count'])

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋ฉด https://github.com/pandas-dev/pandas/issues/14829 ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค(0.24.0์—์„œ ํ•˜๊ณ  ์‹ถ์€ ๊ฒƒ).
(์ค‘์š” ์‚ฌํ•ญ : ์šฐ๋ฆฌ๋Š”์ด ์†”๋ฃจ์…˜์„ ์ง€์›ํ•˜๋ ค๋Š” ๊ฒฝ์šฐ ์šฐ๋ฆฌ๋Š” ์ด๋ฆ„์˜ ์ž๋™ ์ค‘๋ณต ์ œ๊ฑฐ์˜ ์–ด๋–ค ์ข…๋ฅ˜๋ฅผํ•ด์•ผํ•˜๋ฏ€๋กœ์ด๋ฅผ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋žŒ๋‹ค ํ•จ์ˆ˜์˜ ์ด๋ฆ„์ด ์ค‘๋ณต ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค.)


๊ทธ๋Ÿฐ ๋‹ค์Œ, ์šฐ๋ฆฌ๋Š” ์—ฌ์ „ํžˆ ์ด๋ฆ„ ๋ณ€๊ฒฝ์„ ์œ„ํ•œ ํ‚ค์›Œ๋“œ ์ธ์ˆ˜์˜ ๋ฐฉ์‹์„ ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • assign ๊ฐ€ ํŒฌ๋”์—์„œ ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•˜๊ณ  groupby().aggregate() ๊ฐ€ ibis์—์„œ ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹๊ณผ๋„ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: R์˜ dplyr์—์„œ ๋ณด์ด๋Š” ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•จ).
  • ์›ํ•˜๋Š” ๋น„๊ณ„์ธต์  ์—ด ์ด๋ฆ„์„ ์ง์ ‘ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค(MultiIndex ์—†์Œ).
  • ๊ฐ„๋‹จํ•œ ๊ฒฝ์šฐ(์˜ˆ: Series์˜ ๊ฒฝ์šฐ)์˜ ๊ฒฝ์šฐ dict์˜ dict๋กœ ๋” ๊ฐ„๋‹จํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ๊ทธ๊ฒƒ์ด ์–ด๋–ป๊ฒŒ ์ƒ๊ฒผ๋Š”์ง€์— ๋Œ€ํ•ด ์•ฝ๊ฐ„์˜ ํ† ๋ก ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์œ„์—์„œ ์ œ์•ˆํ•œ ๊ฒƒ์€ (์ฒซ ๋ฒˆ์งธ ์˜ˆ์ œ์—์„œ์™€ ๋™์ผํ•œ ์—ด/ํ•จ์ˆ˜ ์„ ํƒ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•จ):

gr.agg(b_sum={'B': 'sum'}, c_mean={'C': 'mean'}, c_count={'C': 'count'})

์—ฌ์ „ํžˆ ์ด ์‚ฌ์–‘์„ dict์˜ dict๋กœ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ํ˜„์žฌ(์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š”) ๋ฒ„์ „๊ณผ ๋น„๊ตํ•˜์—ฌ ๋‚ด๋ถ€ ๋ฐ ์™ธ๋ถ€ ์ˆ˜์ค€์ด ๋ฐ”๋€Œ์—ˆ์Šต๋‹ˆ๋‹ค.

gr.agg(**{'b_sum': {'B': 'sum'}, 'c_mean': {'C': 'mean'}, 'c_count': {'C': 'count'})

(์šฐ๋ฆฌ๋Š” dicts์˜ ๊ธฐ์กด dicts๋ฅผ ์ด ๋ฒ„์ „์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์˜ˆ์ œ ๋„์šฐ๋ฏธ ํ•จ์ˆ˜๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค)

๊ทธ๋Ÿฌ๋‚˜ dict๋Š” ํ•ญ์ƒ ๋‹จ์ผ {col: func} ์ด๊ณ  ์—ฌ๋Ÿฌ ๋‹จ์ผ ์š”์†Œ dicts๋Š” ์•ฝ๊ฐ„ ์ด์ƒํ•ด ๋ณด์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๊ฐ€ ์ƒ๊ฐํ•œ ๋Œ€์•ˆ์€ ํŠœํ”Œ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

gr.agg(b_sum=('B', 'sum'), c_mean=('C', 'mean'), c_count=('C', 'count'))

์ด๊ฒƒ์€ ์กฐ๊ธˆ ๋” ์ข‹์•„ ๋ณด์ด์ง€๋งŒ ๋‹ค๋ฅธ ํ•œํŽธ์œผ๋กœ {'B': 'sum'} dict๋Š” ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•  ์—ด์„ ์ง€์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค๋ฅธ API์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.


์œ„์˜ ๋‘ ์ œ์•ˆ(๋‚˜์ค‘์— ๋” ์‰ฌ์šด ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ ๋ฐ ํ‚ค์›Œ๋“œ ๊ธฐ๋ฐ˜ ์ด๋ฆ„ ์ง€์ •)์€ ์›์น™์ ์œผ๋กœ ์ง๊ตํ•˜์ง€๋งŒ ๋‘˜ ๋‹ค ์žˆ์œผ๋ฉด ์ข‹์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(๋˜๋Š” ์ถ”๊ฐ€ ๋…ผ์˜์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ฒƒ).

๊ฐœ๋ฐœ์ž์˜ ํ˜„์žฌ ์ƒ๊ฐ์„ ์—ฌ๊ธฐ์— ์ „๋‹ฌํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ๐Ÿ˜ƒ

๋‚˜๋Š” ๊ฒฐ๊ณผ์ ์œผ๋กœ MultiIndex๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋” ์ด์ƒ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” dict-of-dict ์ ‘๊ทผ ๋ฐฉ์‹์˜ (๋‚ด ์ƒ๊ฐ์—๋Š” ์œ ์ผํ•œ) ๋‹จ์ ์„ ์ธ์ •ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ ์ถ”๊ฐ€ ์˜ต์…˜(์˜ˆ, YAO :-/ )์„ ์ „๋‹ฌํ•˜๋ฉด ํ‰๋ฉดํ™”๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์–ธ๊ธ‰ํ•œ ๋ฐ”์™€ ๊ฐ™์ด, ์ €๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฐ€๋Šฅํ•œ ํ•œ ๋‘ ๋ฒˆ์งธ ๋ฒ„์ „์— ๋ฐ˜๋Œ€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • ์–ด๋–ป๊ฒŒ๋“  ๋™์ ์œผ๋กœ ์ƒ์„ฑํ•˜๊ณ  ์••์ถ•์„ ํ’‰๋‹ˆ๋‹ค( **{} ๊ตฌ์กฐ ๋•๋ถ„์— Python!)
  • ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ์™€ ์ง‘๊ณ„ ์‚ฌ์–‘์„ ๊ฐ€๊น๊ฒŒ ์œ ์ง€(์ˆœ์„œ๊ฐ€ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€๋˜๋„๋ก ๋‘ ๋ชฉ๋ก์„ ์ถ”์ ํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์€ ์‚ฌ์šฉ์ž IMHO๋กœ์„œ ๋งค์šฐ ์„ฑ๊ฐ€์‹  ์ผ์ž…๋‹ˆ๋‹ค)
  • ํ•จ์ˆ˜ ์ด๋ฆ„(์ž ์žฌ์ ์œผ๋กœ ๋ถ€์กฑํ•˜๊ฑฐ๋‚˜ ์ถฉ๋Œํ•  ์ˆ˜ ์žˆ์Œ)์œผ๋กœ ์ธํ•ด ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• ์—†์ด ๋žŒ๋‹ค ๋˜๋Š” ๋ถ€๋ถ„ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๋งˆ์ง€๋ง‰ ์ œ์•ˆ(col>func ๋งคํ•‘์— ๋Œ€ํ•œ ์‚ฌ์ „ ๋˜๋Š” ํŠœํ”Œ ํฌํ•จ)์€ ๊ดœ์ฐฎ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์ด์ „ ์˜๊ฒฌ์˜ ์ฒซ ๋ฒˆ์งธ ์ œ์•ˆ์€ ์ •๋ง ์›ํ•˜๋ฉด ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ด์— ๋Œ€ํ•œ ์ œ ํ”ผ๋“œ๋ฐฑ์€ ์‚ฌ์šฉ์ž๋กœ์„œ ๋‘ ๋ฒˆ์งธ ๋Œ€์•ˆ๋ณด๋‹ค ๋™๊ธฐํ™”๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ•˜๋Š” ๊ณ ํ†ต ๋•Œ๋ฌธ์— ์„ ํƒํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ๋ชฉ๋ก.

์˜ค๋Š˜ ๊ฐœ๋ฐœ ํšŒ์˜์—์„œ ๋…ผ์˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์งง์€ ์š”์•ฝ

  1. @jorisvandenbossche ๋Š” gr.agg(b_sum=("B", "sum), ...) ๊ตฌํ˜„์„ ์‹œ๋„ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰ arg ์ „๋‹ฌ๋œ *GroupBy.agg ๊ฐ€ ์—†์œผ๋ฉด kwargs๋ฅผ <output_name>=(<selection>, <aggfunc>) ๋กœ ํ•ด์„ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ง๊ต์ด ๋ฌธ์ œ๋กœ ์šฐ๋ฆฌ๊ฐ€ ๊ตฌํ˜„ํ•˜๊ณ ์ž MutliIndex.flatten์„ ํ•˜๊ณ  ์ œ๊ณต flatten=True ํ‚ค์›Œ๋“œ .agg

์•„๋งˆ๋„ ์ด๊ฒƒ์ด ๋„์›€์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ ์ค‘๋‹จ์— ๋Œ€ํ•œ ์ œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์€ alias->aggr ๋งต์„ ์˜ฌ๋ฐ”๋ฅธ ์ด๋ฆ„์˜ ํ•จ์ˆ˜ ๋ชฉ๋ก์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋‹ค์Œ ๋„์šฐ๋ฏธ ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

def aliased_aggr(aggr, name):
    if isinstance(aggr,str):
        def f(data):
            return data.agg(aggr)
    else:
        def f(data):
            return aggr(data)
    f.__name__ = name
    return f

def convert_aggr_spec(aggr_spec):
    return {
        col : [ 
            aliased_aggr(aggr,alias) for alias, aggr in aggr_map.items() 
        ]  
        for col, aggr_map in aggr_spec.items() 
    }

๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ด์ „ ๋™์ž‘์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

mydf_agg = mydf.groupby('cat').agg(convert_aggr_spec{
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),  # lambda
        'energy_p17': lambda x: np.percentile(x, 17),  # lambda
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,   # original function
        'distance_mad_c1': mad_c1,  # partial function wrapping the original function
    },
}))

์™€ ๊ฐ™์€ ๊ฒƒ

mydf_agg = mydf.groupby('cat').agg({
    'energy': [ 
        aliased_aggr('sum', 'total_energy'),
        aliased_aggr(lambda x: np.percentile(x, 98), 'energy_p98'),
        aliased_aggr(lambda x: np.percentile(x, 17), 'energy_p17')
    ],
    'distance': [
         aliased_aggr('sum', 'total_distance'),
         aliased_aggr('mean', 'average_distance'),
         aliased_aggr(smrb.mad, 'distance_mad'),
         aliased_aggr(mad_c1, 'distance_mad_c1'),
    ]
})

์ด๊ฒƒ์€ ๋‚˜๋ฅผ ์œ„ํ•ด ์ž‘๋™ํ•˜์ง€๋งŒ ์ผ๋ถ€ ์ฝ”๋„ˆ ์ผ€์ด์Šค์—์„œ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค ...

์—…๋ฐ์ดํŠธ : ์ง‘๊ณ„ ์‚ฌ์–‘์˜ ํŠœํ”Œ์ด (๋ณ„์นญ, agr)๋กœ ํ•ด์„๋˜๋ฏ€๋กœ ์ด๋ฆ„์„ ๋ฐ”๊ฟ€ ํ•„์š”๊ฐ€ ์—†์Œ์„ ์•Œ์•˜์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ alias_aggr ํ•จ์ˆ˜๋Š” ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฉฐ ๋ณ€ํ™˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

def convert_aggr_spec(aggr_spec):
    return {
        col : [ 
           (alias,aggr) for alias, aggr in aggr_map.items() 
        ]  
        for col, aggr_map in aggr_spec.items() 
    }

์–ด๋–ค ํ•จ์ˆ˜์—์„œ๋“  ์—ด์„ ์ง‘๊ณ„ํ•˜๊ณ  ๊ฐ™์€ ํ–‰์—์„œ ์ฆ‰์‹œ ์ด๋ฆ„์„ ๋ฐ”๊พธ๋Š” ๊ธฐ๋Šฅ์ด ์ •๋ง ์—†๋Š” ๋˜ ๋‹ค๋ฅธ ์‚ฌ์šฉ์ž๋กœ์„œ ์—ฌ๊ธฐ์— ์ฐธ์—ฌํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค. ๋‚˜๋Š” pandas์—์„œ ๋ฐ˜ํ™˜๋œ MultiIndex๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž์‹ ์„ ๋ฐœ๊ฒฌํ•œ ์ ์ด _never_ ์—†์Šต๋‹ˆ๋‹ค. ์ฆ‰์‹œ ๋ณ‘ํ•ฉํ•˜๊ฑฐ๋‚˜ ์‹ค์ œ๋กœ ํŠน์ • ์˜๋ฏธ๋ฅผ ์˜๋ฏธํ•˜๋Š” ์—ด ์ด๋ฆ„์„ ์ˆ˜๋™์œผ๋กœ ์ง€์ •ํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์— ์ œ์•ˆ๋œ ์ ‘๊ทผ ๋ฐฉ์‹ ์ค‘ ํ•˜๋‚˜์— ๋งŒ์กฑํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. SQL๊ณผ ์œ ์‚ฌํ•œ ๊ตฌ๋ฌธ(์‹ค์ œ๋กœ ์ด๋ฏธ pandas์—์„œ .query() ๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Œ), ๊ฐ๊ฐ€์ƒ๊ฐ ๋™์ž‘์œผ๋กœ ๋˜๋Œ๋ฆฌ๊ธฐ, ๋‹ค๋ฅธ ์ œ์•ˆ. ํ˜„์žฌ์˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ด๋ฏธ R์„ ์‚ฌ์šฉํ•˜๋Š” ๋™๋ฃŒ๋“ค๋กœ๋ถ€ํ„ฐ ์กฐ๋กฑ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ตœ๊ทผ์— Pandas ๋Œ€์‹  PySpark๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•„์š”ํ•œ ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ ๊ตฌ๋ฌธ์„ ํ›จ์”ฌ ๋” ์ข‹์•„ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

df.groupby("whatever").agg(
    F.max("col1").alias("my_max_col"),
    F.avg("age_col").alias("average_age"),
    F.sum("col2").alias("total_yearly_payments")
)

๋˜ํ•œ PySpark๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ํŒฌ๋”๋ณด๋‹ค ์“ฐ๊ธฐ์— ํ›จ์”ฌ ๋” ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. ํ›จ์”ฌ ๋” ๊น”๋”ํ•ด ๋ณด์ž…๋‹ˆ๋‹ค! ๊ทธ๋ž˜์„œ ๋‚˜๋Š” ์ด๊ฒƒ์— ๋Œ€ํ•œ ์ž‘์—…์ด ์•„์ง ์™„๋ฃŒ๋˜์—ˆ๋‹ค๋Š” ์ ์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค :-)

์ด ๊ธฐ๋Šฅ์— ๋Œ€ํ•ด ๋™์˜ํ•œ ๊ตฌ๋ฌธ์ด ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

2019๋…„ 3์›” 27์ผ ์ˆ˜์š”์ผ ์˜ค์ „ 9:01 Thomas Kastl [email protected]
์ผ๋‹ค:

๋‚˜๋Š” ๋‹จ์ง€ ์ •๋ง๋กœ, ์ •๋ง๋กœ
๋ชจ๋“  ํ•จ์ˆ˜์—์„œ ์—ด์„ ์ง‘๊ณ„ํ•˜๋Š” ๊ธฐ๋Šฅ์ด ๋ˆ„๋ฝ๋œ ๊ฒฝ์šฐ
์ฆ‰์‹œ ๊ฐ™์€ ํ–‰์—์„œ ์ด๋ฆ„์„ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค. ๋‚˜๋Š” ๋‚˜ ์ž์‹ ์„ ๋ฐœ๊ฒฌํ•œ ์ ์ด ์—†๋‹ค
ํŒฌ๋”๊ฐ€ ๋ฐ˜ํ™˜ํ•œ MultiIndex ์‚ฌ์šฉ - ์ฆ‰์‹œ ๋ณ‘ํ•ฉํ•˜๊ฑฐ๋‚˜,
๋˜๋Š” ์‹ค์ œ๋กœ ์—ด ์ด๋ฆ„์„ ์ˆ˜๋™์œผ๋กœ ์ง€์ •ํ•˜๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.
์‹ค์ œ๋กœ ํŠน์ •ํ•œ ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์— ์ œ์•ˆ๋œ ์ ‘๊ทผ ๋ฐฉ์‹ ์ค‘ ํ•˜๋‚˜์— ๋งŒ์กฑํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. SQL๊ณผ ์œ ์‚ฌํ•œ ๊ตฌ๋ฌธ
(์‹ค์ œ๋กœ ์ด๋ฏธ ํŒ๋‹ค์—์„œ .query()๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค),
๊ฐ๊ฐ€ ์ƒ๊ฐ ๋œ ํ–‰๋™์œผ๋กœ ๋˜๋Œ๋ฆฌ๊ธฐ, ๋‹ค๋ฅธ ์ œ์•ˆ. ๊ทธ๋งŒํผ
ํ˜„์žฌ์˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ด๋ฏธ R์„ ์‚ฌ์šฉํ•˜๋Š” ๋™๋ฃŒ๋“ค๋กœ๋ถ€ํ„ฐ ์กฐ๋กฑ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ตœ๊ทผ์— Pandas ๋Œ€์‹  PySpark๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.
๊ตฌ๋ฌธ์ด ํ›จ์”ฌ ๋” ๋งˆ์Œ์— ๋“ค๊ธฐ ๋•Œ๋ฌธ์— ํ•„์š”ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

df.groupby("๋ฌด์—‡์ด๋“ ").agg( F.max("col1").alias("my_max_col"),
F.avg("age_col").alias("average_age"),
F.sum("col2").alias("total_yearly_payments") )

๋˜ํ•œ PySpark๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ํŒฌ๋”๋ณด๋‹ค ์“ฐ๊ธฐ์— ํ›จ์”ฌ ๋” ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค.
์ด๊ฒƒ์€ ํ›จ์”ฌ ๋” ๊นจ๋—ํ•ด ๋ณด์ž…๋‹ˆ๋‹ค! ๊ทธ๋ž˜์„œ ๋‚˜๋Š” ํ™•์‹คํžˆ ๊ทธ ์ž‘์—…์— ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค
์ด๊ฒƒ์€ ์—ฌ์ „ํžˆ โ€‹โ€‹์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค :-)

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ณ  GitHub์—์„œ ํ™•์ธํ•˜์„ธ์š”.
https://github.com/pandas-dev/pandas/issues/18366#issuecomment-477168767 ,
๋˜๋Š” ์Šค๋ ˆ๋“œ ์Œ์†Œ๊ฑฐ
https://github.com/notifications/unsubscribe-auth/ABQHIkCYYsah5siYA4_z0oop_ufIB3h8ks5va3nJgaJpZM4QjSLL
.

๋‚˜๋Š” 0.25.0์— ๋Œ€ํ•ด ์ด๊ฒƒ์„ ์–ป์œผ๋ ค๊ณ  ๋…ธ๋ ฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

https://github.com/pandas-dev/pandas/pull/26399์— PR์„ ์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค (selection, aggfunc) ํŠœํ”Œ์ด์–ด์•ผ ํ•œ๋‹ค๋Š” ์ดํ•ด์™€ ํ•จ๊ป˜ **kwargs ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฆ„ ๋ฐ”๊พธ๊ธฐ ๋ฐ ์—ด๋ณ„ ์ง‘๊ณ„์˜ ํ˜ผํ•ฉ์„ ํ—ˆ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                    'height': [9.1, 6.0, 9.5, 34.0],
   ...:                    'weight': [7.9, 7.5, 9.9, 198.0]})

In [3]: df
Out[3]:
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [4]: df.groupby('kind').agg(min_height=('height', 'min'), max_weight=('weight', 'max'))
Out[4]:
      min_height  max_weight
kind
cat          9.1         9.9
dog          6.0       198.0

์ด๊ฒƒ์€ ๋ช‡ ๊ฐ€์ง€ ์ œํ•œ ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค

  • ๋‹ค๋ฅธ ํŒ๋‹ค๋“ค์—๊ฒŒ๋Š” ๋‹ค์†Œ ๋…ํŠนํ•ฉ๋‹ˆ๋‹ค. sytanx (output_name=(selection, aggfunc)) ๋Š” ์‹ค์ œ๋กœ ๋‹ค๋ฅธ ๊ณณ์—์„œ๋Š” ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค( .assign ๋Š” output_name=... ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ)
  • ํŒŒ์ด์ฌ ์‹๋ณ„์ž๊ฐ€ ์•„๋‹Œ ์ถœ๋ ฅ ์ด๋ฆ„์˜ ์ฒ ์ž๋Š” ์ถ”์•…ํ•ฉ๋‹ˆ๋‹ค: .agg(**{'output name': (col, func)})
  • Python 3.6+ ์ „์šฉ์ด๊ฑฐ๋‚˜ **kwargs ์˜ ์ˆœ์„œ๊ฐ€ ์ด์ „์— ๋ณด์กด๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— 3.5 ๋ฐ ์ด์ „ ๋ฒ„์ „์— ๋Œ€ํ•ด ์ถ”์•…ํ•œ ํ•ดํ‚น์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • aggfunc๋Š” ๋‹จํ•ญ ํ•จ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž ์ •์˜ aggfunc์— ์ถ”๊ฐ€ ์ธ์ˆ˜๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋จผ์ € ๋ถ€๋ถ„์ ์œผ๋กœ ์ ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋™์ผํ•œ ์—ด์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ lambda aggfuncs๋Š” ์•„์ง ์ง€์›๋˜์ง€ ์•Š์ง€๋งŒ ๋‚˜์ค‘์— ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


์—ฌ๊ธฐ์— ๊ฐ€์ž…ํ•œ ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ๋“ค์ด ๋” ์ด์ƒ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๋™์ž‘์— ๋Œ€ํ•œ ๋Œ€์•ˆ์„ ์ง€์ง€ํ•  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ๋“ค์€ ์ด๊ฒƒ์„ ๊ตฌ์ฒด์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์ƒ๊ฐํ•ฉ๋‹ˆ๊นŒ?

cc @WillAyd ๊ท€ํ•˜์˜ ์šฐ๋ ค ์‚ฌํ•ญ์„ ๋†“์นœ ๊ฒฝ์šฐ.

์•ˆ๋…•ํ•˜์„ธ์š” @TomAugspurger ๋‹˜ ,

์ง„ํ–‰ํ•ด ์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ๋ช‡ ๊ฐ€์ง€ ์ œํ•œ ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค

  • ๋‹ค๋ฅธ ํŒ๋‹ค๋“ค์—๊ฒŒ๋Š” ๋‹ค์†Œ ๋…ํŠนํ•ฉ๋‹ˆ๋‹ค. sytanx (output_name=(selection, aggfunc)) ๋Š” ์‹ค์ œ๋กœ ๋‹ค๋ฅธ ๊ณณ์—๋Š” ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค( .assign ๋Š” output_name=... ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ)

์ด๋Ÿฐ ์ข…๋ฅ˜์˜ ์ฃผ์žฅ์€ ์• ์ดˆ์— ๊ธฐ์กด ๊ตฌํ˜„์„ ํ„ํ•˜ํ•˜๋Š” ๋™๊ธฐ๋ฅผ ๋ถ€์—ฌํ•œ ์ฃผ์žฅ๊ณผ ์ƒ๋‹นํžˆ ์œ ์‚ฌํ•˜๋‹ค๋Š” ๋Š๋‚Œ์„ ์ง€์šธ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

ํŠน์ • ์ฃผ์žฅ๊ณผ ๊ด€๋ จํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ์‹๋ณด๋‹ค ์ด ์ƒˆ๋กœ์šด ๋ฐฉ์‹์—์„œ ๋” ๋งŽ์€ ์ด์ ์„ ์–ป๋Š” ์ด์œ ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

๋‚ด๊ฐ€ ์ด๋ฏธ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋Š” ํ•œ ๊ฐ€์ง€ ์ด์ ์€ (py3.6+์˜ ๊ฒฝ์šฐ) ์—ด์˜ ์ถœ๋ ฅ ์ˆœ์„œ๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • python ์‹๋ณ„์ž๊ฐ€ ์•„๋‹Œ ์ถœ๋ ฅ ์ด๋ฆ„์˜ ์ฒ ์ž๋Š” ๋ณด๊ธฐ ํ‰ํ•ฉ๋‹ˆ๋‹ค: .agg(**{'output name': (col, func)})

๊ทธ๋Ÿฐ ์ ์—์„œ ์˜›๋‚  ๋ฐฉ์‹์ด ๋” ์ข‹์•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์•ž์„œ ๋งํ–ˆ๋“ฏ์ด **{...} ๊ตฌ๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ง‘๊ณ„๋ฅผ ๋™์ ์œผ๋กœ ๊ตฌ์ถ•ํ•  ์ˆ˜๋งŒ ์žˆ๋‹ค๋ฉด ์ถฉ๋ถ„ํžˆ ๋งŒ์กฑํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • Python 3.6+ ์ „์šฉ์ด๊ฑฐ๋‚˜ **kwargs ์˜ ์ˆœ์„œ๊ฐ€ ์ด์ „์— ๋ณด์กด๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— 3.5 ๋ฐ ์ด์ „ ๋ฒ„์ „์— ๋Œ€ํ•ด ์ถ”์•…ํ•œ ํ•ดํ‚น์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ด์ „์—๋Š” ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ–ˆ์Šต๋‹ˆ๊นŒ(๊ธฐ์กด dict-of-dict ๊ธฐ๋Šฅ)? ์–ด๋–ค ์‹ ์œผ๋กœ๋“  ์ฃผ๋ฌธ์ด ๋ณด์žฅ ๋˜์—ˆ์Šต๋‹ˆ๊นŒ?

  • aggfunc๋Š” ๋‹จํ•ญ ํ•จ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž ์ •์˜ aggfunc์— ์ถ”๊ฐ€ ์ธ์ˆ˜๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋จผ์ € ๋ถ€๋ถ„์ ์œผ๋กœ ์ ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‚ด ์ดํ•ด๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด: aggfunc๋Š” ์œ ํšจํ•œ ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ชจ๋“  ํ˜ธ์ถœ ๊ฐ€๋Šฅ ๊ฐ์ฒด๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ? ( 'min' , 'max' ๋“ฑ๊ณผ ๊ฐ™์€ "์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š”" ๋ฌธ์ž์—ด aggfungs ์™ธ์—๋„). ์ „๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋‹ฌ๋ผ์ง„ ์ ์ด ์žˆ๋‚˜์š”? (์ฆ‰, ๋‹จํ•ญ ์ œํ•œ์ด ์ด๋ฏธ ์กด์žฌํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๊นŒ?)

๊ทธ๋ฆฌ๊ณ  ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋™์ผํ•œ ์—ด์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ lambda aggfuncs๋Š” ์•„์ง ์ง€์›๋˜์ง€ ์•Š์ง€๋งŒ ๋‚˜์ค‘์— ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ, ๊ทธ๊ฒƒ์€ ์ผ์ข…์˜ ์„ฑ๊ฐ€์‹  ์ผ์ด์ง€๋งŒ ์ผ์‹œ์ ์ธ ์ œํ•œ์ด๊ณ  ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ด๋ ค ์žˆ๋Š” ํ•œ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์— ๊ฐ€์ž…ํ•œ ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ๋“ค์ด ๋” ์ด์ƒ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๋™์ž‘์— ๋Œ€ํ•œ ๋Œ€์•ˆ์„ ์ง€์ง€ํ•  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ๋žŒ๋“ค์€ ์ด๊ฒƒ์„ ๊ตฌ์ฒด์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์ƒ๊ฐํ•ฉ๋‹ˆ๊นŒ?

๊ธ€์Ž„, ์–ด์จŒ๋“  ๋‚˜๋Š” ํ•œ ๋‹จ๊ณ„์—์„œ ์ง‘๊ณ„ํ•˜๊ณ  ์ด๋ฆ„์„ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด ์ •๋ง ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ๋™์ž‘์ด ์‹ค์ œ๋กœ ์˜ต์…˜์ด ์•„๋‹Œ ๊ฒฝ์šฐ ์ด ๋Œ€์•ˆ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

ํŠน์ • ์ฃผ์žฅ๊ณผ ๊ด€๋ จํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ์‹๋ณด๋‹ค ์ด ์ƒˆ๋กœ์šด ๋ฐฉ์‹์ด ๋” ๋งŽ์€ ์ด์ ์„ ์ œ๊ณตํ•˜๋Š” ์ด์œ ๋ฅผ ๊ณต์œ ํ•ด ์ฃผ์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?

๋‚ด๊ฐ€ ์ž˜๋ชป ๊ธฐ์–ตํ•˜๊ณ  ์žˆ์„์ง€๋„ ๋ชจ๋ฅด์ง€๋งŒ SeriesGroupby.agg์™€ DataFrameGroupby.agg๋Š” ์‚ฌ์ „์˜ ์™ธ๋ถ€ ํ‚ค ์‚ฌ์ด์— ๋‹ค๋ฅธ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค(์—ด ์„ ํƒ์ž…๋‹ˆ๊นŒ ์•„๋‹ˆ๋ฉด ์ถœ๋ ฅ ์ด๋ฆ„ ์ง€์ •์ž…๋‹ˆ๊นŒ?). ์ด ๊ตฌ๋ฌธ์„ ์‚ฌ์šฉํ•˜๋ฉด ํ‚ค์›Œ๋“œ๊ฐ€ ์ถœ๋ ฅ ์ด๋ฆ„์„ ์˜๋ฏธํ•˜๋„๋ก ์ผ๊ด€๋˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ์ ์—์„œ ์˜›๋‚  ๋ฐฉ์‹์ด ๋” ์ข‹์•˜๋‹ค.

์ฐจ์ด์ ์€ ** ๊นŒ? ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋™์ผํ•œ ์ œํ•œ ์‚ฌํ•ญ์ด ๊ณต์œ ๋œ๋‹ค๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

์ด์ „์—๋Š” ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ–ˆ์Šต๋‹ˆ๊นŒ(๊ธฐ์กด dict-of-dict ๊ธฐ๋Šฅ)? ์–ด๋–ค ์‹ ์œผ๋กœ๋“  ์ฃผ๋ฌธ์ด ๋ณด์žฅ ๋˜์—ˆ์Šต๋‹ˆ๊นŒ?

์ง€๊ธˆ PR์—์„œ ํ•˜๊ณ  ์žˆ๋Š” ํ‚ค ์ •๋ ฌ์ž…๋‹ˆ๋‹ค.

๋‚ด ์ดํ•ด๋ฅผ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด: aggfunc๋Š” ์œ ํšจํ•œ ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ชจ๋“  ํ˜ธ์ถœ ๊ฐ€๋Šฅ ๊ฐ์ฒด๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๊นŒ?

์ฐจ์ด์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

In [21]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})

In [22]: def aggfunc(x, myarg=None):
    ...:     print(myarg)
    ...:     return sum(x)
    ...:

In [23]: df.groupby("A").agg({'B': {'foo': aggfunc}}, myarg='bar')
/Users/taugspurger/sandbox/pandas/pandas/core/groupby/generic.py:1308: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super().aggregate(arg, *args, **kwargs)
None
Out[23]:
    B
  foo
A
a   3

๋Œ€์•ˆ ์ œ์•ˆ์œผ๋กœ ์šฐ๋ฆฌ๋Š” ์ถœ๋ ฅ ์—ด ์ด๋ฆ„์„ ์œ„ํ•ด **kwargs ๋ฅผ ์˜ˆ์•ฝํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ functools.partitial(aggfunc, myarg='bar') ํ•ฉ๋‹ˆ๋‹ค.

๋„ค, ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต์— ๋Œ€ํ•ด ๐Ÿ‘๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค(๋‹ค์ค‘ ๋žŒ๋‹ค ๊ตฌํ˜„ ์ œํ•œ์ด ์ œ๊ฑฐ๋˜๋Š” ์ฆ‰์‹œ ๊ต์ฒด๋กœ ๊ดœ์ฐฎ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค).

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰