ãã®åé¡ã¯ã groupby.agg
ã§ã®dictã®åã©ãã«ä»ããå»æ¢ãããåŸã®ïŒ15931ã®è°è«ã«åºã¥ããŠäœæãããŠããŸãã 以äžã«èŠçŽãããŠããããšã®å€ãã¯ãåã®è°è«ã§ãã§ã«è°è«ãããŠããŸãã ç¹ã«https://github.com/pandas-dev/pandas/pull/15931#issuecomment-336139085ããå§ãã
ïŒ15931ã®å»æ¢ã®èåŸã«ããåæ©ã¯ãäž»ã«ã·ãªãŒãºãšããŒã¿ãã¬ãŒã éã§agg()
äžè²«ããã€ã³ã¿ãŒãã§ã€ã¹ãæäŸããããšã«é¢é£ããŠããŸããïŒã³ã³ããã¹ãã«ã€ããŠã¯ïŒ14668ãåç
§ããŠãã ããïŒã
ãã¹ããããdictã䜿çšããåã©ãã«ä»ãæ©èœã¯ãè€éããããããã³/ãŸãã¯äžè²«æ§ããªããããéæšå¥šã§ãããšèª¬æãããŠããŸãã
ãã ããããã«ã¯ä»£åã䌎ããŸããéçŽãšååå€æŽãåæã«è¡ãããšãã§ããªããšãéåžžã«åä»ãªåé¡ãçºçããé©åãªåé¿çããªãå Žåã¯åŸæ¹äºææ§ã倱ãããŸãã
__name__
å±æ§ãå€æŽããªãéããéšåé¢æ°ãã2ã€ä»¥äžã®ã¢ã°ãªã²ãŒã¿ãŒãé©çšããããšã¯ã§ããŸããã_ïŒããã¯ãåé¡ãã§ããã ãçãã³ãŒãã§ç€ºãããšãç®çãšããå·§åŠãªäŸã§ãããããã§ç€ºããåé¡ã¯ãã¹ãŠãå€æŽåŸã®å®éã®ç掻ããããã»ã©åçŽã§ã¯ãªãç¶æ³ã§ç§ãæ©ãŸããŸããã ïŒ_
mydf = pd.DataFrame(
{
'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
},
index=range(6)
)
cat distance energy
0 A 1.20 1.80
1 A 1.50 1.95
2 A 1.74 2.04
3 B 0.82 1.25
4 B 1.01 1.60
5 C 0.60 1.01
æžã蟌ã¿ãšèªã¿åããç°¡åã§ãæåŸ ã©ããã«æ©èœããŸã
import numpy as np
import statsmodels.robust as smrb
from functools import partial
# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)
# renaming and specifying the aggregators at the same time
# note that I want to choose the resulting column names myself
# for example "total_xxxx" instead of just "sum"
mydf_agg = mydf.groupby('cat').agg({
'energy': {
'total_energy': 'sum',
'energy_p98': lambda x: np.percentile(x, 98), # lambda
'energy_p17': lambda x: np.percentile(x, 17), # lambda
},
'distance': {
'total_distance': 'sum',
'average_distance': 'mean',
'distance_mad': smrb.mad, # original function
'distance_mad_c1': mad_c1, # partial function wrapping the original function
},
})
çµæã¯
energy distance
total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A 5.79 2.0364 1.8510 4.44 1.480 0.355825 0.240
B 2.85 1.5930 1.3095 1.83 0.915 0.140847 0.095
C 1.01 1.0100 1.0100 0.60 0.600 0.000000 0.000
æ®ã£ãŠããã®ã¯ïŒ
# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)
ãã³ãã称ããããããŒãã³ã¹ððºïŒ
import numpy as np
import statsmodels.robust as smrb
from functools import partial
# median absolute deviation as a partial function
# in order to demonstrate the issue with partial functions as aggregators
mad_c1 = partial(smrb.mad, c=1)
# no way of choosing the destination's column names...
mydf_agg = mydf.groupby('cat').agg({
'energy': [
'sum',
lambda x: np.percentile(x, 98), # lambda
lambda x: np.percentile(x, 17), # lambda
],
'distance': [
'sum',
'mean',
smrb.mad, # original function
mad_c1, # partial function wrapping the original function
],
})
ã©ã ãé¢æ°ã¯ãã¹ãŠ<lambda>
ãšããååã®åã«ãªããçµæã¯
SpecificationError: Function names must be unique, found multiple named <lambda>
åŸæ¹äºææ§ã®ãªãååž°ïŒ2ã€ã®ç°ãªãã©ã ããåãå ã®åã«é©çšããããšã¯ã§ããªããªããŸããã
äžèšããlambda x: np.percentile(x, 98)
ãåé€ãããšãå
ã®é¢æ°ããé¢æ°åãç¶æ¿ããéšåé¢æ°ã§åãåé¡ãçºçããŸãã
SpecificationError: Function names must be unique, found multiple named mad
æåŸã«ãããŒã·ã£ã«ã®__name__
å±æ§ãäžæžãããåŸïŒããšãã°ã mad_c1.__name__ = 'mad_c1'
ïŒã次ã®ããã«ãªããŸãã
energy distance
sum <lambda> sum mean mad mad_c1
cat
A 5.79 1.8510 4.44 1.480 0.355825 0.240
B 2.85 1.3095 1.83 0.915 0.140847 0.095
C 1.01 1.0100 0.60 0.600 0.000000 0.000
ãŸã
å¥ã®ã¹ãããã§åŠçããŸãã
éèšåŸã®ååãå¶åŸ¡ããããšã¯ã§ããŸãããèªååãããæ¹æ³ã§ååŸã§ããæåã®æ¹æ³ã¯ãå ã®ååãš_aggregateé¢æ°ã®åå_ã次ã®ããã«çµã¿åãããããšã§ãã
mydf_agg.columns = ['_'.join(col) for col in mydf_agg.columns]
ãã®çµæïŒ
energy_sum energy_<lambda> distance_sum distance_mean distance_mad distance_mad_c1
cat
A 5.79 1.8510 4.44 1.480 0.355825 0.240
B 2.85 1.3095 1.83 0.915 0.140847 0.095
C 1.01 1.0100 0.60 0.600 0.000000 0.000
æ¬åœã«å¥ã®ååãå¿ èŠãªå Žåã¯ã次ã®ããã«å®è¡ã§ããŸãã
mydf_agg.rename({
"energy_sum": "total_energy",
"energy_<lambda>": "energy_p17",
"distance_sum": "total_distance",
"distance_mean": "average_distance"
}, inplace=True)
ãã ããããã¯ãååå€æŽã³ãŒãïŒã³ãŒãå ã®å¥ã®å Žæã«é 眮ããå¿ èŠããããŸãïŒããéèšãå®çŸ©ãããŠããã³ãŒããšåæãããããã«æ³šæããå¿ èŠãããããšãæå³ããŸã...
æ²ãããã³ããŠãŒã¶ãŒð¢ïŒãã¡ãããã³ãã倧奜ãã§ãïŒ
ç§ã¯äžè²«æ§ãä¿ã€ããã«å šåãå°œãããŠãããšåæã«ã_aggregateããã³rename_æ©èœãå»æ¢ãããããšãæ·±ãåŸæããŠããŸãã äžèšã®äŸã§åé¡ç¹ãæ確ã«ãªãããšãé¡ã£ãŠããŸãã
_ãªãã·ã§ã³ã®èªã¿åãïŒ_
ãã§ã«æ°ãæéè¡ãããŠãããã«ãªã¯ãšã¹ãã§ã®åè¿°ã®è°è«ã«é¢ããŠãç§ããã®éæšå¥šã«æ©ãŸãããŠããçç±ã®1ã€ã«æè¿æ°ã¥ããŸããããéçŽããŠååãå€æŽãããã¯ãåœç¶ã®ããšã§ãã SQLã§ã¯éåžžãéèšåŒã®ããé£ã«å®å
ååãæå®ãããããSQLã§ã®GROUP BYéèšïŒäŸïŒ SELECT col1, avg(col2) AS col2_mean, stddev(col2) AS col2_var FROM mytable GROUP BY col1
ã
ç§ã¯_not_ãã³ãã¯å¿ ãããåœç¶ã®SQLãšåãæ©èœãæäŸããªããã°ãªããªãããšãèšã£ãŠããŸãã ããããäžèšã®äŸã¯ãdict-of-dictAPIãå€ãã®ãŠãŒã¹ã±ãŒã¹ã«å¯Ÿããã¯ãªãŒã³ã§ã·ã³ãã«ãªãœãªã¥ãŒã·ã§ã³ã§ãããšç§ãèããçç±ã瀺ããŠããŸãã
ïŒ*ç§ã¯ãdict-of-dictã¢ãããŒããè€éã§ããããšã«å人çã«åæããŸãããïŒ
@zertrin ïŒããããŸãšããŠãããŠããããšãã ããã«ã€ããŠã¯ãïŒ15931ã§å€ãã®è°è«ããã£ãããšãããããŸããã ãããå®å šã«èªãããšãã§ããªãã£ãã®ã§ãçŸæç¹ã§ã¯ã³ã¡ã³ãã§ããŸããã ããã§ããpingãå®è¡ãããŠãã ããã
@jreback @jorisvandenbossche @TomAugspurger @ chris-b1
ãã®äŸã§ã¯ãçŸåšã®agg
å®è£
ã§ã®ååå€æŽãéåžžã«äžæ Œå¥œã§ãå£ããŠããããšã«åæããŸãã ãã¹ããããdictã¯ããè€éã§ãããããªããè¡ã£ãããã«ããããæžããšãäœãèµ·ãã£ãŠããã®ããéåžžã«æ確ã«ãªããŸãã
names
ãã©ã¡ãŒã¿ãagg
è¿œå ãããéçŽåãæ°ããååã«ãããã³ã°ããèŸæžãå¿
èŠã«ãªãå¯èœæ§ããããšæããŸãã å¥ã®ãã©ã¡ãŒã¿ãŒdrop_index
ãããŒã«å€ãšããŠè¿œå ããŠãäžäœã®ã€ã³ããã¯ã¹ã¬ãã«ãç¶æãããã©ããã決å®ããããšãã§ããŸãã
ãããã£ãŠãæ§æã¯æ¬¡ã®ããã«ãªããŸãã
agg_dict = {'energy': ['sum',
lambda x: np.percentile(x, 98), # lambda
lambda x: np.percentile(x, 17), # lambda
],
'distance': ['sum',
'mean',
smrb.mad, # original function
mad_c1, # partial function wrapping the original function
]
}
name_dict = {'energy':['energy_sum', 'energy_p98', 'energy_p17'],
'distance':['distance_sum', 'distance_mean', 'distance_mad', 'distance_mad_c1']}
mydf.groupby('cat').agg(agg_dict, names=name_dict, drop_index=True)
ãŸãã¯ããŸã£ããæ°ããã¡ãœããagg_assign
ãäœæããããšãã§ããŸããããã¯ã DataFrame.assign
ãšåæ§ã«æ©èœããŸãã
mydf.groupby('cat').agg_assign(energy_sum=lambda x: x.energy.sum(),
energy_p98=lambda x: np.percentile(x.energy, 98),
energy_p17=lambda x: np.percentile(x.energy, 17),
distance_sum=lambda x: x.distance.sum(),
distance_mean=lambda x: x.distance.mean(),
distance_mad=lambda x: smrb.mad(x.distance),
distance_mad_c1=lambda x: mad_c1(x.distance))
ç§ã¯å®éããã®ãªãã·ã§ã³ã®æ¹ãã¯ããã«å¥œãã§ãã
ãã®äŸ¡å€ã«ã€ããŠã¯ãæ©èœãæžäŸ¡ååŽããªãããšã«ã匷ãè³æã§ãã
ç§ã«ãšã£ãŠã®å€§ããªçç±ã¯ãPythonã®é¢æ°ã®åå空éïŒç¹å®ã®å®è£
ã«é¢ä¿ãããã®ïŒãšååã®ããŒã¿ïŒå®è£
ã«ã€ããŠç¢ºå®ã«ç¥ããªãã¯ãã®ãã®ïŒãæ··åããããšã«ã€ããŠãéåžžã«å¥åŠãªããšããããšããããšã§ãã '<lambda>'
ãšããååã®åïŒå Žåã«ãã£ãŠã¯è€æ°ã®åïŒã衚瀺ãããŠãããšããäºå®ã¯ãç§ã«æ·±å»ãªèªç¥çäžååãåŒãèµ·ãããŸãã
äžèŠãªïŒãããŠå ¬éãããïŒååãæã¡è¶ããããã®äžéã¹ãããããããããååå€æŽã®ã¢ãããŒãã¯ãããã§ãã ããã«ãå®è£ ã«äŸåããå¯èœæ§ãããããã確å®ã«äœç³»çã«ååãå€æŽããããšã¯å°é£ã§ãã
ãããé€ãã°ããã¹ããããdictæ©èœã¯ç¢ºãã«è€éã§ãããå®è¡ãããŠããã®ã¯è€éãªæäœã§ãã
TL; DRæžäŸ¡ååŽããªãã§ãã ããã :)
ç§ã®è²¢ç®ã¯2ã€ã®ããšã«ãã£ãŠåæ©ä»ããããŠããŸãã
ãŸããPandasã·ãªãŒãºãšDataFrameãªããžã§ã¯ãã«ã¯ããã€ãã©ã€ã³åã容æã«ããããã®pipe
ã¡ãœããããããŸãã ãã®ããã¥ã¡ã³ãã»ã°ã¡ã³ãã§ã¯ããµãã¯ã©ã¹åã®ä»£ããã«ã¡ãœããã®ãããã·ã«pipe
ã䜿çšã§ããããšã説æãããŠããŸãã åã粟ç¥ã§ãæ°ããGroupBy.pipe
ã䜿çšããŠåæ§ã®åœ¹å²ãå®è¡ããgroupbyãªããžã§ã¯ãã®ãããã·ã¡ãœãããæ§ç¯ã§ããããã«ããããšãã§ããŸãã
@zertrinã®äŸã䜿çšããŸã
import numpy as np
import statsmodels.robust as smrb
from functools import partial
# The DataFrame offered up above
mydf = pd.DataFrame(
{
'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
},
index=range(6)
)
# Identical dictionary passed to `agg`
funcs = {
'energy': {
'total_energy': 'sum',
'energy_p98': lambda x: np.percentile(x, 98), # lambda
'energy_p17': lambda x: np.percentile(x, 17), # lambda
},
'distance': {
'total_distance': 'sum',
'average_distance': 'mean',
'distance_mad': smrb.mad, # original function
'distance_mad_c1': mad_c1, # partial function wrapping the original function
},
}
# Write a proxy method to be passed to `pipe`
def agg_assign(gb, fdict):
data = {
(cl, nm): gb[cl].agg(fn)
for cl, d in fdict.items()
for nm, fn in d.items()
}
return pd.DataFrame(data)
# All the API we need already exists with `pipe`
mydf.groupby('cat').pipe(agg_assign, fdict=funcs)
ãã®çµæ
distance energy
average_distance distance_mad distance_mad_c1 total_distance energy_p17 energy_p98 total_energy
cat
A 1.480 0.355825 0.240 4.44 1.8510 2.0364 5.79
B 0.915 0.140847 0.095 1.83 1.3095 1.5930 2.85
C 0.600 0.000000 0.000 0.60 1.0100 1.0100 1.01
pipe
ã¡ãœããã䜿çšãããšãå€ãã®å Žåãæ°ããAPIãè¿œå ããå¿
èŠããªããªããŸãã ãŸãããããã説æããŠããéæšå¥šã®æ©èœã眮ãæããæ段ãæäŸããŸãã ãããã£ãŠãç§ã¯éæšå¥šãé²ããããšæããŸãã
ç§ã¯tdpetrouã®ã¢ã€ãã¢ãæ¬åœã«å¥œãã§ã-䜿çšããïŒ names=name_dict
ã
ããã¯ã¿ããªã幞ãã«ããããšãã§ããŸãã ããã«ãããå¿ èŠã«å¿ããŠåã®ååãç°¡åã«å€æŽ
å®éã«ã¯ãç§ã®æåã®æçš¿ã§è¿°ã¹ãããã«ãããã§ã¯ãéèšæäœãå®çŸ©ãããŠããå Žæãçµæã®åã®ååããåãé¢ããšããåé¡ã¯è§£æ±ºããããäž¡æ¹ããåæããããŠããããšã確èªããããã®è¿œå ã®äœæ¥ãå¿ èŠã«ãªããŸãã
ãããæªã解決çã ãšã¯èšããŸãããïŒçµå±ã®ãšãããä»ã®åé¡ã解決ããŸãïŒãdictã®dictã¢ãããŒãã»ã©ç°¡åã§æ確ã§ã¯ãããŸããã ã€ãŸããæžã蟌ã¿æã«ã¯ããªã¹ãã®äž¡æ¹ã®dictãåæãããå¿ èŠãããããœãŒã¹ãèªã¿åããšãããªãŒããŒã¯ããªã¹ãã®2çªç®ã®dictã®ååãããªã¹ãã®æåã®dictã®éçŽå®çŸ©ãšäžèŽãããããã«åªåããå¿ èŠããããŸãã ãããã®å Žåããããã¯2åã®åŽåã§ãã
ãã¹ããããdictã¯ããè€éã§ãããããªããè¡ã£ãããã«ããããæžããšãäœãèµ·ãã£ãŠããã®ããéåžžã«æ確ã«ãªããŸãã
ãªã誰ããdictã®dictãè€éã ãšèšã£ãŠããããã«èŠããã®ãç§ã¯ãŸã ç解ããŠããŸããã ç§ã«ãšã£ãŠãããã¯ãããè¡ãæãæ確ãªæ¹æ³ã§ãã
ãšã¯ããããã³ãããŒã ãæºè¶³ã§ããå¯äžã®è§£æ±ºçãnames
ããŒã¯ãŒãã§ããå Žåã§ããããã§ãçŸåšã®ç¶æ³ãæ¹åããããšãã§ããŸãã
@pirsquaredçŸåšã®APIã䜿çšããèå³æ·±ããœãªã¥ãŒã·ã§ã³ã ç§ã®æèŠã§ã¯ç解ããã®ã¯ç°¡åã§ã¯ãããŸãããïŒç§ã¯ãããã©ã®ããã«æ©èœããã®ãæ¬åœã«ç解ããŠããŸããïŒconfused :)
ç§ã¯ããŒã¿ãµã€ãšã³ã¹ã®ãµãã¬ãã£ããã«é¢ããã¹ã¬ãããéå§ããŸãã-ãã³ãã«ã€ããŠäœãå«ãã§ããïŒ ã 誰ããgroupby
åŸã«è¿ãããMultiIndexã«å¯Ÿãã軜èãæã¡åºããplydataã«å®è£
ãããŠããdplyr do
åè©ãæãããŸããã ããŸããŸagg_assign
ãŸã£ããåãããã«æ©èœããã®ã§ãéåžžã«èå³æ·±ããã®ã§ããã
@zertrin agg_assign
ã¯ãdict of dictã¢ãããŒããããåªããŠãããSQLéèšãšåãã§ããã ãã§ãªããéèšå
ã§è€æ°ã®åãçžäºã«å¯Ÿè©±ã§ããããã«ããŸãã ãŸãã DataFrame.assign
ãšåãããã«æ©èœããŸãã
@jreback @TomAugspurgerã«ã€ããŠäœãèãã¯ãããŸããïŒ
..ã
mydf.groupbyïŒ 'cat'ïŒãaggïŒagg_dictãnames = name_dictãdrop_index = TrueïŒ
ããã§åé¡ã¯è§£æ±ºããŸãããããŒãšå€ã2ãæã«æããå¿
èŠããããŸãã ãã®ãããªç°¿èšã³ãŒããå¿
èŠãšããªãAPIïŒ .agg_assign
ææ¡ãããŠããïŒã¯ããšã©ãŒãçºçãã«ãããšæããŸãã
APIã䜿çšããåŸã®ã¯ãªãŒã³ã¢ããã³ãŒãã®åé¡ããããŸãã groupby
æäœãMultiIndex
ããŒã¿ãã¬ãŒã ãè¿ãå Žåãã»ãšãã©ã®å ŽåããŠãŒã¶ãŒã¯MultiIndex
å
ã«æ»ããŸãã .agg_assign
ã䜿çšããç°¡åãªå®£èšåã®æ¹æ³ã¯ãéå±€ã MultiIndex
åºåãåŸã®ã¯ãªãŒã³ã¢ããããªãããšã瀺ããŠããŸãã
䜿çšãã¿ãŒã³ã«åºã¥ããŠããã«ãã€ã³ããã¯ã¹åºåã¯å³å¯ã«ãªããã€ã³ã§ããããªããã¢ãŠãã§ã¯ãªããšæããŸãã
ç§ã¯åœåã agg_assign
ææ¡ã«æçç
ç¹ã«ã agg_assign(**relabeling_dict)
ã®åœ¢åŒã§äœ¿çšããŠã relabeling_dict
ããã«å®çŸ©ã§ããå¯èœæ§ã«ã€ããŠèããŠã¿ãŸãããã
relabeling_dict = {
'energy_sum': lambda x: x.energy.sum(),
'energy_p98': lambda x: np.percentile(x.energy, 98),
'energy_p17': lambda x: np.percentile(x.energy, 17),
'distance_sum': lambda x: x.distance.sum(),
'distance_mean': lambda x: x.distance.mean(),
'distance_mad': lambda x: smrb.mad(x.distance),
'distance_mad_c1': lambda x: mad_c1(x.distance)
}
ããã¯éåžžã«æè»ã§ãç§ã®OPã§èšåãããŠãããã¹ãŠã®åé¡ã解決ããŸãã
@zertrin @ has2k1
ç§ã¯ããã«ã€ããŠããå°ãèããŠããŸããããããŠãã®æ©èœã¯ãã§ã«apply
ãŸãã æ°ããååãšããŠã€ã³ããã¯ã¹ãæã¡ãéèšãšããŠå€ãæã€ã·ãªãŒãºãè¿ãã ãã§ãã ããã«ãããååã«ã¹ããŒã¹ãå«ããããšãã§ããåãåžæã©ããã«äžŠã¹æ¿ããããšãã§ããŸãã
def my_agg(x):
data = {'energy_sum': x.energy.sum(),
'energy_p98': np.percentile(x.energy, 98),
'energy_p17': np.percentile(x.energy, 17),
'distance sum' : x.distance.sum(),
'distance mean': x.distance.mean(),
'distance MAD': smrb.mad(x.distance),
'distance MAD C1': mad_c1(x.distance)}
return pd.Series(data, index=list_of_column_order)
mydf.groupby('cat').apply(my_agg)
ãããã£ãŠãæ°ããã¡ãœããã¯å¿ èŠãªããããããŸãããã代ããã«ããã¥ã¡ã³ãã®ããè¯ãäŸãå¿ èŠã§ãã
@tdpetrou ãããªãã¯æ£ããã§ãã é«é-äœéãã¹éžæããã»ã¹ã§ã®äºéå®è¡ã®ãããèªåã®ããŒãžã§ã³ã䜿çšããŠãããšãã«apply
ã©ã®ããã«æ©èœããããå¿ããŠããŸããã
確ãã«ãããã¥ã¡ã³ããèªãã ã ãã§éèšã³ã³ããã¹ãã§äœ¿çšããããšãèããå¯èœæ§ã¯ãããŸãã...
ããã«ãç§ã¯ãŸã apply
ã®è§£æ±ºçãå°ãè€éããããšæããŠããŸãã agg_assign
ã¢ãããŒãã¯ãããåçŽã§ç解ããããããã«èŠããŸããã
ããã«ã€ããŠã®å£°æã¯å®éã«ã¯ãªãã£ãã®ã§ã dict-of-dict
ã¢ãããŒãïŒçŸåšã¯éæšå¥šã§ããããã§ã«å®è£
ãããŠãããããããã¹ãŠã®åé¡ã解決ããŸãïŒã¯æ¬åœã«åé¡å€ã§ããïŒ
agg_assign
ã¢ãããŒããé€ããŠã dict-of-dict
äŸç¶ãšããŠæãåçŽãªã¢ãããŒãã®ããã§ãããã³ãŒãã£ã³ã°ãå¿
èŠãšãããéæšå¥šã§ã¯ãããŸããã
agg_assign
ã¢ãããŒãã®å©ç¹ãšæ¬ ç¹ã¯ãåã®éžæãéèšã¡ãœããã«ããã·ã¥ããããšã§ãã ãã¹ãŠã®äŸã§ã x
æž¡ãããlambda
ã¯ã DataFrameGroupBy
ãªããžã§ã¯ãã§ããself
åã°ã«ãŒãã®self.get_group(group)
ãããªãã®ã§ãã ããã¯ã **kwargs
ã«ããååä»ãããé¢æ°ã«ããéžæç¯å²ãããããã«åé¢ããã®ã§äŸ¿å©ã§ãã
æ¬ ç¹ã¯ãåªããæ±çšéèšé¢æ°ãåã®éžæã«é¢ä¿ããŠããå¿
èŠãããããšã§ãã ããªãŒã©ã³ãã¯ãããŸããïŒ ã€ãŸãã lambda x: x[col].min
ãããªå€ãã®ãã«ããŒãå¿
èŠã«ãªããšããããšã§ãã ãŸãããã¹ãŠã®ãã£ã¡ã³ã·ã§ã³ã§åæžãããnp.min
ãšã axis=0
ã§åæžãããpd.DataFrame.min
ãªã©ã«ã泚æããå¿
èŠããããŸãã 以äžã®ãããªäœãçç±ã§ãagg_assign
ãšåçã§ã¯ãªãã§ãããapply
ã apply
ã¯ãç¹å®ã®ã¡ãœããã«å¯ŸããŠåŒãç¶ãååäœã§åäœããŸãã
ãããã®ãã¬ãŒããªããšdict-of-dictsã¡ãœããã«ã€ããŠã¯ããããããŸããããä»ã®äººã®èããèããŠã¿ãããšæããŸãã ããã¯agg_assign
倧ãŸããªã¹ã±ããã§ããããã¯ãé¢æ°ãåã§ã¯ãªãããŒãã«ã«æž¡ãããããšã匷調ããããã«ã agg_table
ãšåŒãã§ããŸãã
from collections import defaultdict
import pandas as pd
import numpy as np
from pandas.core.groupby import DataFrameGroupBy
mydf = pd.DataFrame(
{
'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
},
index=range(6)
)
def agg_table(self, **kwargs):
output = defaultdict(dict)
for group in self.groups:
for k, v in kwargs.items():
output[k][group] = v(self.get_group(group))
return pd.concat([pd.Series(output[k]) for k in output],
keys=list(output),
axis=1)
DataFrameGroupBy.agg_table = agg_table
䜿çšæ³
>>> gr = mydf.groupby("cat")
>>> gr.agg_table(n=len,
foo=lambda x: x.energy.min(),
bar=lambda y: y.distance.min())
n foo bar
A 3 1.80 1.20
B 2 1.25 0.82
C 1 1.01 0.60
ããã®ããã©ãŒãã³ã¹ãããã»ã©ã²ã©ãããªãããã«ããããã«å°ãã§ãããšæããŸããã .agg
ã¯ãããŸãã...
Pandas Core Teamã®èª°ããã groupby.agg
ã§dictã®ã©ãã«å€æŽãå»æ¢ããäž»ãªçç±ã説æããŠããã ããŸããïŒ
ã³ãŒããç¶æããã®ã«åé¡ãå€ããããã©ããã¯ç°¡åã«ç解ã§ããŸããããšã³ããŠãŒã¶ãŒã®è€éãã«ã€ããŠã¯ãå¿ èŠãªåé¿çãšæ¯èŒããŠããªãæ確ãªã®ã§ãå ã«æ»ãããšãéžæããŸã...
ããããšãïŒ
Pandasã³ã¢ããŒã ã®èª°ãããgroupby.aggã§dictã®ã©ãã«å€æŽãå»æ¢ããäž»ãªçç±ã説æããŠããã ããŸããïŒ
https://github.com/pandas-dev/pandas/pull/15931/files#diff -52364fb643114f3349390ad6bcf24d8fR461ãèŠãŸãããïŒ
äž»ãªçç±ã¯ãdictããŒã2ã€ã®ããšãè¡ãããã«éè² è·ã«ãªã£ãŠããããšã§ããã Series / SeriesGroupByã®å Žåããããã¯ååä»ãçšã§ãã DataFrame / DataFrameGroupByã®å Žåãåãéžæããããã®ãã®ã§ãã
In [32]: mydf.aggregate({"distance": "min"})
Out[32]:
distance 0.6
dtype: float64
In [33]: mydf.aggregate({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
#!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[33]:
distance
foo 0.6
In [34]: mydf.distance.agg({"foo": "min"})
Out[34]:
foo 0.6
Name: distance, dtype: float64
In [35]: mydf.groupby("cat").agg({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py:4201: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[35]:
distance
foo
cat
A 1.20
B 0.82
C 0.60
In [36]: mydf.groupby("cat").distance.agg({"foo": "min"})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
#!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[36]:
foo
cat
A 1.20
B 0.82
C 0.60
ããã¯ãããããã³ãã§æãæ··ä¹±ããããšã§ã¯ãªãã®ã§ãããããç§ãã¡ã¯ãããå蚪ããããšãã§ããŸã:)ç§ã¯ããããããã€ãã®ãšããžã±ãŒã¹ãèŠéããŠããŸãã ãã ããdict-of-dictséèšãåé€ããŠããååä»ããšåã®éžæã®éã«ççŸããããŸãã
Series / SeriesGroupByã®å ŽåãèŸæžããŒã¯åžžã«åºåã«ååãä»ããããã®ãã®ã§ãã
DataFrame / DataFrameGroupbyã®å ŽåãdictããŒã¯åžžã«éžæçšã§ãã dict-of-dictsã䜿çšããŠåãéžæãããšãSeries / SeriesGroupByãšåæ§ã«ãå éšdictãåºåã«ååãä»ããããã®ãã®ã«ãªããŸãã
ããã«ã€ããŠã¯åã«ç°¡åã«èª¬æããŸãããïŒéæšå¥šã«é¢ããé·ãè°è«ã®ã©ããã§ïŒã https ïŒ
åé¡ã¯ãdictããéžæãïŒãã®é¢æ°ãé©çšããåïŒãšãååå€æŽãïŒãã®é¢æ°ãé©çšãããšãã«çµæã®ååã«ãªãã¯ãïŒã®äž¡æ¹ã«äœ¿çšãããŠããããšã§ããã agg_assign
ææ¡ã§èª¬æãããŠããããã«ãdict以å€ã®ä»£æ¿æ§æã¯ãããŒã¯ãŒãåŒæ°ã§ããå¯èœæ§ããããŸãã
ãããagg
èªäœã«ãããã agg_assign
ãããªæ°ããæ¹æ³ã«ãããã«ããããããç§ã¯ãŸã ãã®å¯èœæ§ãæ¢æ±ããããšã«è³æã§ãã
åœæç§ãææ¡ããã®ã¯agg_assign
䌌ãŠããŸãããã©ã ãé¢æ°ã®ä»£ããã«ããŒã¯ãŒãããšã«dictã䜿çšããŠããŸããã ããã®äŸã«ç¿»èš³ãããšãããã¯æ¬¡ã®ããã«ãªããŸãã
mydf.groupby('cat').agg(
energy_sum={'energy': 'sum'},
energy_p98={'energy': lambda x: np.percentile(x, 98)},
energy_p17={'energy': lambda x: np.percentile(x, 17)},
distance_sum={'distance': 'sum'},
distance_mean={'distance': 'mean'},
distance_mad={'distance': smrb.mad},
distance_mad_c1={'distance': mad_c1})
ãã¹ãŠã®ã©ã ããå«ãããŒãžã§ã³ãšããŠããããå¿ ãããèªã¿ããããæžãããããã©ããã¯ããããŸãããããã³ãã¯ãå®è¡ããåã§åèšãå¹³åãªã©ã«æé©åãããå®è£ ã䜿çšã§ãããããããã©ãŒãã³ã¹ãåäžããå¯èœæ§ããããŸããã©ã ãé¢æ°ãŸãã¯ãŠãŒã¶ãŒæå®é¢æ°ã¯ãããŸããã
ãã®ã¢ãããŒãã®å€§ããªåé¡ã¯ã df.groupby('cat').agg(foo='mean')
ãäœãæå³ããã®ããšããããšã§ãã ããã¯ãéžæãè¡ããªãã£ãããããã¹ãŠã®åã«ãmeanããè«ççã«é©çšããŸãïŒä»¥åã®{'col1' : {'foo': 'mean'}, 'col2': {'foo':'mean'}, 'col3': ...}
ãšåæ§ïŒã ãã ããããã«ãããã«ãã€ã³ããã¯ã¹åãäœæãããŸãããäžèšã®äŸã§ã¯ãMIåã§çµãããªãããã«ãããšãããšæããŸãã
äžèšã¯æ¢åã®agg
å
ã§äžäœäºææ§ããããšæããŸãããåé¡ã¯ãããå¿
èŠãã©ããã§ãã
ãŸããããã¯æ¬¡ã®ãããªseries
å Žåã«ãããŸãæ¡åŒµã§ãããšæããŸãã
mydf.groupby('cat').distance.agg(
distance_sum='sum',
distance_mean='mean',
distance_mad=smrb.mad,
distance_mad_c1=mad_c1)
ïŒãããŠãäžèšããè·é¢ãã«å¯ŸããŠ1åãããšãã«ã®ãŒãã«å¯ŸããŠ1åå®è¡ãããã¹ãŠã®dict /ã©ã ããæ°ã«å ¥ããªãå Žåã¯çµæãé£çµããããšãæ€èšã§ããŸãïŒ
@TomAugspurger agg_table
åçŽãªå®è£
äŸã§ã¯ãã°ã«ãŒããå埩ããã®ã§ã¯ãªããé©çšããããŸããŸãªé¢æ°ãå埩ããæçµçã«æ°ããåãaxis = 1ã§é£çµããæ¹ãããã§ããããæ°ãã圢æãããè¡ãaxis = 0ã§é£çµãã代ããã«ïŒ
ãšããã§ã @ zertrin @tdpetrou @smcateer @pirsquaredãªã©ããã®åé¡ãæèµ·ãããã®ãããªè©³çŽ°ãªãã£ãŒãããã¯ãæäŸããŠãããŠããããšãã ãã®ãããªãã£ãŒãããã¯ãšã³ãã¥ããã£ã®é¢äžã¯éåžžã«éèŠã§ãã
ç§ã¯å®éã«@tdpetrouã«ãã£ãŠææ¡ããããã¿ãŒã³ãæ¬åœã«å¥œã
é¢æ°ãpd.Series(data, index=data.keys())
è¿ãå Žåãæ£ããé åºã§ã€ã³ããã¯ã¹ãååŸããããšãä¿èšŒãããŠããŸããïŒ ïŒç§ã®ã³ãŒãã«ãã¿ãŒã³ãå®è£
ããããã®æè¯ã®æ¹æ³ãèããŠããã ãã§ã-ãããã¯ããå€ãããªã¹ã¯ããããŸãïŒã
ç·šéïŒç³ãèš³ãããŸããããã€ã³ããã¯ã¹åŒæ°ã®ãã€ã³ãã誀解ããŸããïŒããã§ã¯ãªãã·ã§ã³ã§ãããåã®é åºãæå®ããå Žåã«ã®ã¿å¿
èŠã§ãã pd.Series(data)
è¿ããšããŸããããŸãïŒã
@tdpetrouã®äŸã¯ã first
ãšlast
éèšã§æ©èœããŸããïŒ
ç§ã¯ãã®ããã«é /å°Ÿã«é Œããªããã°ãªããŸããã§ãã
def agg_funcs(x):
data = {'start':x['DATE_TIME'].head(1).values[0],
'finish':x['DATE_TIME'].tail(1).values[0],
'events':len(x['DATE_TIME'])}
return pd.Series(data, index = list(data.keys()))
results = df.groupby('col').apply(agg_funcs)
ç§ã¯ãŸã ããã«å¯ŸåŠããããšæããŸããã0.23ã§è¡ããããšã¯æããŸããã
@tdpetrouã®ã¢ãããŒãã¯ãã³ãŒãã§äºåºŠãšäœ¿çšããªãé¢æ°ãå®çŸ©ããªããŠãæ©èœããŸããïŒ Q / Kdb +ã®äžçïŒSQLãšåæ§ïŒããæ¥ãŠããã®ã§ãåçŽãªselectã¹ããŒãã¡ã³ãã®æéå€æ°/é¢æ°ãäœæããå¿ èŠãããçç±ãããããŸããã
ããã§OPã
æ£çŽãªãšããããããŸã§ã®ãã¹ãŠã®æéãšïŒ15931ãšããã§ã®å€ãã®è°è«ã®åŸãç§ã¯ãããåã©ãã«ä»ãã®å£è¿°ãéæšå¥šã«ããè¯ãèãã§ãããšãŸã 確信ããŠããŸããã
çµå±ãããã§ææ¡ããã代æ¿æ¡ã¯ã©ãããçŸåšã®åã©ãã«ä»ãdictã¢ãããŒãIMHOããããŠãŒã¶ãŒã«ãšã£ãŠçŽæçã§ã¯ãããŸããã ãããããã¥ã¡ã³ãã«ãã£ããšããã»ãã®äžäŸã§ããããã©ã®ããã«æ©èœããããæ確ã§ãããéåžžã«æè»æ§ããããŸãã
ãã¡ããããã³ãã®éçºè ã¯ãŸã å¥ã®ããšãèããŠãããããããŸããããŠãŒã¶ãŒã®èŠç¹ã«ãšããããŠããã ãã§ãã
dictã®åã©ãã«ä»ãã¢ãããŒãã§ãããããŸãçŽæçã§ã¯ãããŸããã ç§ã®æèŠã§ã¯ãæ§æã¯SQL- func(column_name) as new_column_name
䌌ãŠããã¯ãã§ãã Pythonã§ã¯ã3é
ç®ã®ã¿ãã«ã䜿çšããŠãããè¡ãããšãã§ããŸãã (func, column_name, new_column_name)
ã ããã¯ãdexploãgroupbyéèšãè¡ãæ¹æ³ã§ãã
@zertrinäžèšã®ç§ã®ææ¡ã«ã€ããŠãã£ãŒãããã¯ããããŸããïŒ https ïŒ
çµå±ãããã¯dictã®é åºãéã«ããŸããã{colïŒ{nameïŒfunc}}ãã®ä»£ããã«ãã** {nameïŒ{colïŒfunc}}ãã®ãããªãã®ã«ãªããŸãã
@jorisvandenbosscheç§ã¯ããªãã®ã¢ãããŒããæ€èšããŸããã åé¡ã¯ããããçŸåšã®ã¢ãããŒãã«ã©ã®ãããªè¿œå ã®å©ç¹ãããããã®ããç§ã«ã¯ããããããŸããã
ãã£ãšççŽã«èšããšã次ã®éžæè¢ããããŸãã
éçºè ãšãŠãŒã¶ãŒã®èŠ³ç¹ããæå³ã®ããå ·äœçãªå©ç¹ãããããããªãéãããªã2ãéžæããå¿ èŠãããã®ãââããããŸããã
äžèšã®ææ¡ã®ããã€ãã®ãã€ã³ãã«å¯ŸåŠããã«ã¯ïŒ
åé¡ã¯ãdictããéžæãïŒãã®é¢æ°ãé©çšããåïŒãšãååå€æŽãïŒãã®é¢æ°ãé©çšãããšãã«çµæã®ååã«ãªãã¯ãïŒã®äž¡æ¹ã«äœ¿çšãããŠããããšã§ããã
以åã¯ããŸãææžåãããŠããã®ã§ããŠãŒã¶ãŒã«ãšã£ãŠã¯åé¡ã§ã¯ãªãã£ããšæã
dicts以å€ã®ä»£æ¿æ§æã¯ãããŒã¯ãŒãåŒæ°ã§ããå¯èœæ§ããããŸã
dict-of-dictã¢ãããŒãã䜿çšããé
åçãªç¹ã®1ã€ã¯ããŠãŒã¶ãŒãä»ã®ã³ãŒãã䜿çšããŠåçã«ç°¡åã«çæã§ããããšã§ãã ãã®ã³ã¡ã³ãã®ããäžã®ã³ã¡ã³ãã§ææããããã«ãåœé¡ã®ããã«ããŒã¯ãŒãåŒæ°ã«ç§»åããŠãã **{name: {col: func}}
æ§é ãä»ããŠãããè¡ãããšãã§ããŸãã ã ããç§ã¯ããªãã®ææ¡ã«å察ããŠããŸããã çŸåšå®è£
ãããŠããã·ã¹ãã ãšåãã¬ãã«ã®æ©èœããã§ã«å®çŸããŠããã®ã«ãä»å 䟡å€ããã®ãããªå€æŽã®å¿
èŠæ§ãããããŸããã
çµå±ããã³ãã®ã³ã¢éçºè ãçŸåšã®ã¢ãããŒãã«å¯ŸããŠåŒ·ãææ ãæã£ãŠããå Žåãããªãã®ææ¡ã¯_倧äžå€«_ã«ãªããŸãã _user_ãšããŠã®ã¡ãªããã¯èŠåœãããŸããã ïŒå®éãæ¢åã®ãã¹ãŠã®ãŠãŒã¶ãŒã³ãŒããå€æŽããŠãæ°ããææ¡ã§åã³æ©èœãããããšã«ã¯æ¬ ç¹ããããŸãïŒã
@zertrinæšæ¥ãããã€ãã®ã³ã¢ã«çãã
ãããã£ãŠãæåã«è¿°ã¹ããšãSQL "SELECT avgïŒcol2ïŒas col2_avg"ã®ãããªåºæ¬çãªæ©èœã¯æ©èœããç°¡åã§ããå¿ èŠããããšããæŠå¿µã¯ãç§ãã¡ãå®å šã«åæãããã®ã§ãããããã«å¯Ÿãã解決çãæ¬åœã«å¿ èŠã§ãã
ããã¯ããªããå®éãããªããšãããã«ãã€ã³ããã¯ã¹ãäœæãããšã¯å¥ã«ããªãªãžãã«ã®çç±ãããç§ãã¡ã¯ãdictsã®çŸåšã®ïŒéæšå¥šïŒdictsããªãããšãçæ³çã§ãããïŒãŸãã¯ãã®åŒ·ãã§ãã£ãŠããªããŠãããïŒããããå»æ¢ããããšã決ããŸããã
In [1]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': range(3), 'C': [.1, .2, .3]})
In [3]: gr = df.groupby('A')
In [4]: gr.agg({'B': {'b_sum': 'sum'}, 'C': {'c_mean': 'mean', 'c_count': 'count'}})
Out[4]:
C B
c_count c_mean b_sum
A
a 2 0.2 2
b 1 0.2 1
äžèšã§ã¯ãåã®ååãå
·äœçã«å€æŽããŠãããããMultiIndexã®æåã®ã¬ãã«ã¯äžèŠã§ãïŒOPã®äŸã§ã¯ããã®çŽåŸã«åã®æåã®ã¬ãã«ãåé€ããŸãïŒã
ãã ããMultiIndexãå¿
èŠã§ãããçã«ããªã£ãŠããå Žåã¯ã gr.agg(['sum', 'mean'])
ãïŒæ··åïŒ gr.agg({'B': ['sum', 'mean'], 'C': {'c_mean': 'mean', 'c_count': 'count'}})
ãªã©ãå®è¡ã§ããããããããå€æŽããã®ã¯å°é£ã§ãã
ãããã£ãŠãäžèšã®èª¬æã§èšåãããææ¡ã®1ã€ã¯ãæçµçãªååãåå¥ã«æå®ããæ¹æ³ãçšæããããšã§ããïŒããšãã°ãhttpsïŒ//github.com/pandas-dev/pandas/issues/18366#issuecomment-346683449ïŒã
ããšãã°ã aggregate
ã«è¿œå ã®ããŒã¯ãŒããè¿œå ããŠã次ã®ããã«ååãæå®ããŸãã
gr.agg({'B': 'sum', 'C': ['mean', 'count']}, columns=['b_sum', 'c_mean', 'c_count'])
å¯èœã ããã
ãã ããå/é¢æ°ã®ä»æ§ãšæ°ããååãåå²ãããšããããæ°ããããŒã¯ãŒããããäžè¬çã«ããŠã次ã®ããã«ããããšãã§ããŸãã
gr.agg({'B': 'sum', 'C': ['mean', 'count']}).rename(columns=['b_sum', 'c_mean', 'c_count'])
ããã解決ããã«ã¯ã httpsïŒ//github.com/pandas-dev/pandas/issues/14829ãå¿
èŠã§ãïŒ0.24.0ã§å®è¡ãããããšïŒã
ïŒéèŠãªæ³šæïŒãã®ããã«ã¯ãã©ã ãé¢æ°ã®ååã®éè€ã®åé¡ãä¿®æ£ããå¿
èŠãããããããã®ãœãªã¥ãŒã·ã§ã³ããµããŒãããå Žåã¯ãååã®ããçš®ã®èªåéè€æé€ãå®è¡ããå¿
èŠããããŸããïŒ
ããã§ããååãå€æŽããããã®ããŒã¯ãŒãåŒæ°ã®æ¹æ³ã¯æ°ã«å ¥ã£ãŠããŸãã ãã®çç±ã¯æ¬¡ã®ãšããã§ãã
assign
åäœã«äŒŒãŠãããã€ãã¹ã§ã®groupby().aggregate()
åäœã«ã䌌ãŠããŸãïŒããšãã°ãRã®dplyrã§ã®åäœã«ã䌌ãŠããŸãïŒããããã©ã®ããã«èŠãããã«ã€ããŠã¯ããŸã å°ãè°è«ããããŸããã äžã§ææ¡ããã®ã¯ïŒæåã®äŸãšåçã®å/é¢æ°ã®éžæã䜿çšããããïŒïŒ
gr.agg(b_sum={'B': 'sum'}, c_mean={'C': 'mean'}, c_count={'C': 'count'})
dictã®dictãšããŠãã®ä»æ§ãæ§ç¯ããããšã¯ã§ããŸãããçŸåšã®ïŒéæšå¥šã®ïŒããŒãžã§ã³ãšæ¯èŒããŠå éšã¬ãã«ãšå€éšã¬ãã«ãå ¥ãæ¿ããŠããŸãã
gr.agg(**{'b_sum': {'B': 'sum'}, 'c_mean': {'C': 'mean'}, 'c_count': {'C': 'count'})
ïŒdictã®æ¢åã®dictããã®ããŒãžã§ã³ã«å€æãããã«ããŒé¢æ°ã®äŸããããŸãïŒ
ãã ããdictã¯åžžã«åäžã®{col: func}
ã§ããããããã®è€æ°ã®åäžèŠçŽ ã®dictã¯å°ãå¥åŠã«èŠããŸãã ãããã£ãŠãç§ãã¡ãèãã代æ¿æ¡ã¯ãã¿ãã«ã䜿çšããããšã§ãã
gr.agg(b_sum=('B', 'sum'), c_mean=('C', 'mean'), c_count=('C', 'count'))
ããã¯å°ãè¯ã{'B': 'sum'}
dictã¯ãé¢æ°ãé©çšããåãæå®ããããã®ä»ã®APIãšäžèŽããŠããŸãã
äžèšã®äž¡æ¹ã®ææ¡ïŒåŸã§ååãå€æŽããã®ãç°¡åã§ãããŒã¯ãŒãããŒã¹ã®ååä»ãïŒã¯ååãšããŠçŽäº€ããŠããŸãããäž¡æ¹ïŒãŸãã¯ãããªãè°è«ã«åºã¥ããŠããã«äœãïŒããããšäŸ¿å©ã§ãã
ããã«éçºè ããã®çŸåšã®èãã転éããŠããã ãããããšãããããŸãð
ç§ã®æèŠã§ã¯ãéæšå¥šã®dict-of-dictã¢ãããŒããšçµæãšããŠåŸãããMultiIndexã®æ¬ ç¹ãèªããŸãã ãŠãŒã¶ãŒãè¿œå ã®ãªãã·ã§ã³ãæž¡ããšããã©ããåãããå¯èœæ§ããããŸãïŒããYAOïŒ-/ïŒã
åè¿°ã®ããã«ã次ã®ããšãå¯èœãªéããç§ã¯2çªç®ã®ããŒãžã§ã³ã«å察ããŠããŸããã
**{}
ã³ã³ã¹ãã©ã¯ãã®ãããã§ãPythonã§ãïŒïŒãã®ãããæåŸã®ææ¡ïŒcol> funcãããã³ã°ã®dictãŸãã¯ã¿ãã«ã䜿çšïŒã¯å€§äžå€«ã ãšæããŸãã
åã®ã³ã¡ã³ãã®æåã®ææ¡ã¯ãæ¬åœã«å¿ èŠãªå Žåã«å®è£ ã§ããŸãããããã«é¢ããç§ã®ãã£ãŒãããã¯ã¯ããŠãŒã¶ãŒãšããŠãç©äºã®åæãç¶æããã®ãé¢åãªããã2çªç®ã®éžæè¢ããã䜿çšããããšãéžæããªããšããããšã§ãã 2ã€ã®ãªã¹ãã
ä»æ¥ã®éçºè äŒè°ã§è°è«ãããŸããã
ç°¡åãªèŠçŽ
gr.agg(b_sum=("B", "sum), ...)
ãå®è£
ããããšããŸããã€ãŸãã arg
ã*GroupBy.agg
ã«æž¡ãããªãå Žåãkwargsã<output_name>=(<selection>, <aggfunc>)
ãšããŠè§£éããŸããflatten=True
ããŒã¯ãŒãã.agg
æäŸããããšæããŸãããã¶ãããã¯åœ¹ã«ç«ã¡ãŸãïŒéæšå¥šã®ç§ã®åé¿çã¯ãalias-> aggrããããæ£ããååãä»ããããé¢æ°ã®ãªã¹ãã«çœ®ãæãããããã®ãã«ããŒé¢æ°ã§ãïŒ
def aliased_aggr(aggr, name):
if isinstance(aggr,str):
def f(data):
return data.agg(aggr)
else:
def f(data):
return aggr(data)
f.__name__ = name
return f
def convert_aggr_spec(aggr_spec):
return {
col : [
aliased_aggr(aggr,alias) for alias, aggr in aggr_map.items()
]
for col, aggr_map in aggr_spec.items()
}
ããã«ãããå€ãåäœã次ã®ããã«ãªããŸãã
mydf_agg = mydf.groupby('cat').agg(convert_aggr_spec{
'energy': {
'total_energy': 'sum',
'energy_p98': lambda x: np.percentile(x, 98), # lambda
'energy_p17': lambda x: np.percentile(x, 17), # lambda
},
'distance': {
'total_distance': 'sum',
'average_distance': 'mean',
'distance_mad': smrb.mad, # original function
'distance_mad_c1': mad_c1, # partial function wrapping the original function
},
}))
ããã¯ãšåãã§ã
mydf_agg = mydf.groupby('cat').agg({
'energy': [
aliased_aggr('sum', 'total_energy'),
aliased_aggr(lambda x: np.percentile(x, 98), 'energy_p98'),
aliased_aggr(lambda x: np.percentile(x, 17), 'energy_p17')
],
'distance': [
aliased_aggr('sum', 'total_distance'),
aliased_aggr('mean', 'average_distance'),
aliased_aggr(smrb.mad, 'distance_mad'),
aliased_aggr(mad_c1, 'distance_mad_c1'),
]
})
ããã¯ç§ã«ãšã£ãŠã¯ããŸããããŸãããããã€ãã®ã³ãŒããŒã±ãŒã¹ã§ã¯ããããããŸããããªãã§ããã...
æŽæ°ïŒéèšä»æ§ã®ã¿ãã«ã¯ïŒãšã€ãªã¢ã¹ãaggrïŒãšããŠè§£éããããããååã®å€æŽã¯äžèŠã§ããããšãããããŸããã ãããã£ãŠãalias_aggré¢æ°ã¯äžèŠã§ãããå€æã¯æ¬¡ã®ããã«ãªããŸãã
def convert_aggr_spec(aggr_spec):
return {
col : [
(alias,aggr) for alias, aggr in aggr_map.items()
]
for col, aggr_map in aggr_spec.items()
}
é¢æ°ã®åãéçŽããŠããã«åãè¡ã®ååãå€æŽããæ©èœãæ¬åœã«äžè¶³ããŠããå¥ã®ãŠãŒã¶ãŒãšããŠãããã§ãã£ã€ã ã鳎ãããããšæããŸãã ãã³ãããè¿ãããMultiIndexã䜿çšããŠããããšã«æ°ä»ããããšããããŸãããããã«ãã©ããåããããå®éã«ã¯ç¹å®ã®æå³ãæã€ååãæåã§æå®ããããšæããŸãã
ããã§ææ¡ãããŠããã¢ãããŒãã®ããããã«æºè¶³ããŸããSQLã®ãããªæ§æïŒå®éã«ã¯ãã§ã«ãã³ãã§.query()
ãé »ç¹ã«äœ¿çšããŠããŸãïŒãæžäŸ¡ååŽãããåäœã«æ»ãããã®ä»ã®ææ¡ã çŸåšã®ã¢ãããŒãã§ã¯ãRã䜿çšããŠããååãããã§ã«å²ç¬ãããŠããŸãã
ç§ã¯æè¿ãæ§æãéåžžã«å¥œãã§ãããšããçç±ã ãã§ããã³ãã®ä»£ããã«PySparkã䜿çšããŠããããšã«æ°ä»ããŸããããããã¯å¿ èŠã§ã¯ãããŸããã§ããã
df.groupby("whatever").agg(
F.max("col1").alias("my_max_col"),
F.avg("age_col").alias("average_age"),
F.sum("col2").alias("total_yearly_payments")
)
ãŸããPySparkã¯ãã»ãšãã©ã®å Žåããã³ããããæžãã®ãã¯ããã«è€éã§ããããã¯éåžžã«ãããã«èŠããŸãã ã ããç§ã¯ééããªãããã«é¢ããäœæ¥ããŸã è¡ãããŠããããšãæè¬ããŠããŸã:-)
ãã®æ©èœã«ã€ããŠã¯ãåæãããæ§æããããšæããŸãã 誰ããå¿
èŠã§ã
ãããå®è£
ããŸãã
9:01ããŒãã¹Kastlã®æ°Žã2019幎3æ27æ¥ã«ã¯[email protected]
æžããŸããïŒ
æ¬åœã«æ¬åœã«æ¬åœã«å¥ã®ãŠãŒã¶ãŒãšããŠããã§ãã£ã€ã ã鳎ããããã ãã§ã
é¢æ°ã®åãéçŽããæ©èœããªãã
ããã«åãè¡ã®ååãå€æŽããŸãã ç§ã¯èªåèªèº«ãèŠã€ããããšããããŸãã
ãã³ãããè¿ãããMultiIndexã䜿çšãã-ç§ã¯ããã«ãããå¹³åŠåãããã
ãŸãã¯ãååãæåã§æå®ãããã®ã¯ã
å®éã«ã¯ç¹å®ã®äœããæå³ããŸããããã§ææ¡ãããŠããã¢ãããŒãã®ããããã«æºè¶³ããŸãïŒSQLã®ãããªæ§æ
ïŒç§ã¯å®éã«ãã³ãã§.queryïŒïŒãé »ç¹ã«äœ¿çšããŠããããšã«æ°ã¥ããŸããïŒã
æžäŸ¡ååŽãããåäœã«æ»ããä»ã®ææ¡ã®ããããã The
çŸåšã®ã¢ãããŒãã§ã¯ãRã䜿çšããŠããååãããã§ã«å²ç¬ãããŠããŸããç§ã¯æè¿ããã³ãã®ä»£ããã«PySparkã䜿çšããŠããããšã«æ°ã¥ããŸããã
æ§æããšãŠã奜ãã ãããšããçç±ã ãã§ãããã¯å¿ èŠã§ã¯ãããŸããã§ãããdf.groupbyïŒ "whatever"ïŒãaggïŒF.maxïŒ "col1"ïŒãaliasïŒ "my_max_col"ïŒã
F.avgïŒ "age_col"ïŒãaliasïŒ "average_age"ïŒã
F.sumïŒ "col2"ïŒãaliasïŒ "total_yearly_payments"ïŒïŒãŸããPySparkã¯ãã»ãšãã©ã®å Žåããã³ããããæžãã®ãã¯ããã«è€éã§ãã
ããã¯ãšãŠããããã«èŠããŸãïŒ ã ããç§ã¯ééããªããã®äœæ¥ã«æè¬ããŸã
ããã¯ãŸã è¡ãããŠããŸã:-)â
ããªããèšåãããã®ã§ãããªãã¯ãããåãåã£ãŠããŸãã
ãã®ã¡ãŒã«ã«çŽæ¥è¿ä¿¡ããGitHubã§è¡šç€ºããŠãã ãã
https://github.com/pandas-dev/pandas/issues/18366#issuecomment-477168767 ã
ãŸãã¯ã¹ã¬ããããã¥ãŒãããŸã
https://github.com/notifications/unsubscribe-auth/ABQHIkCYYsah5siYA4_z0oop_ufIB3h8ks5va3nJgaJpZM4QjSLL
ã
0.25.0ã§ããã«å°éããããšããŠããŸã
https://github.com/pandas-dev/pandas/pull/26399ã«PRã(selection, aggfunc)
ã¿ãã«ã§ããå¿
èŠãããããšãç解ããäžã§ã **kwargs
ã䜿çšããŠãååå€æŽãšååºæã®éèšããã®ããã«çµã¿åãããããšãèš±å¯ããããšã§ãã
In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
In [3]: df
Out[3]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [4]: df.groupby('kind').agg(min_height=('height', 'min'), max_weight=('weight', 'max'))
Out[4]:
min_height max_weight
kind
cat 9.1 9.9
dog 6.0 198.0
ããã«ã¯ããã€ãã®å¶éããããŸã
(output_name=(selection, aggfunc))
ã¯ãå®éã«ã¯ä»ã®å Žæã«ã¯è¡šç€ºãããŸããïŒãã ãã .assign
ã¯output_name=...
ãã¿ãŒã³ã䜿çšããŸãïŒ.agg(**{'output name': (col, func)})
**kwargs
ã®é åºã以åã¯ä¿æãããŠããªãã£ãããã3.5以åã§ã¯éãããã¯ãå¿
èŠã§ãããŸããå®è£
ã®è©³çŽ°ããããŸããåãåã«å¯Ÿããè€æ°ã®lambda
aggfuncsã¯ãŸã ãµããŒããããŠããŸããããåŸã§ä¿®æ£ã§ããŸãã
ããã§ãµãã¹ã¯ã©ã€ãããŠããã»ãšãã©ã®äººã¯ãéæšå¥šã®åäœã«ä»£ããããã€ãã®æ¹æ³ãæ¯æããŠãããšæããŸãã 人ã ã¯ãããå ·äœçã«ã©ãæããŸããïŒ
ç§ãããªãã®æžå¿µã®ãããããéããå Žåã¯cc @ WillAyd ã
ããã«ã¡ã¯@TomAugspurger ã
ãããåé²ãããŠãããŠããããšãã
ããã«ã¯ããã€ãã®å¶éããããŸã
- ããã¯ä»ã®ãã³ãã«ããããç¬ç¹ã§ãã sytanx
(output_name=(selection, aggfunc))
ã¯ãå®éã«ã¯ä»ã®å Žæã«ã¯è¡šç€ºãããŸããïŒãã ãã.assign
ã¯output_name=...
ãã¿ãŒã³ã䜿çšããŸãïŒ
ãã®çš®ã®è°è«ã¯ãããããæ¢åã®å®è£ ãå»æ¢ããåæ©ãšãªã£ããã®ãšéåžžã«äŒŒãŠããããã«æããããåŸãŸããã
_ãã®ç¹å®ã®è°è«ã«é¢ããŠ_ããªãç§ãã¡ãå€ãæ¹æ³ããããã®æ°ããæ¹æ³ããããå€ãã®å©çãåŸãã®ããå ±æã§ããŸããïŒ
ç§ããã§ã«èããŠããå©ç¹ã®1ã€ã¯ãïŒpy3.6 +ã®å ŽåïŒåã®åºåé åºãåå¥ã«éžæã§ããããšã§ãã
- Pythonèå¥åã§ã¯ãªãåºååã®ã¹ãã«ã¯éãã§ãïŒ
.agg(**{'output name': (col, func)})
ã©ããããããããã®ç¹ã§ã¯å€ãæ¹æ³ã®æ¹ãåªããŠããŸããã ããããåã«è¿°ã¹ãããã«ã **{...}
æ§é ã䜿çšããŠåçã«éèšãæ§ç¯ã§ããéããç§ã¯ååã«æºè¶³ããŠããŸãã
- Python 3.6以éã®ã¿ã§ãããŸãã¯ã
**kwargs
ã®é åºã以åã¯ä¿æãããŠããªãã£ãããã3.5以åã§ã¯éãããã¯ãå¿ èŠã§ãã
以åã¯ã©ã®ããã«æ©èœããŠããŸãããïŒæ¢åã®dict-of-dictæ©èœïŒïŒ 泚æã¯äœããã®æ¹æ³ã§ä¿èšŒãããŸãããïŒ
- aggfuncã¯åé é¢æ°ã§ããå¿ èŠããããŸãã ã«ã¹ã¿ã aggfuncã«è¿œå ã®åŒæ°ãå¿ èŠãªå Žåã¯ãæåã«éšåçã«é©çšããå¿ èŠããããŸã
ç§ã®ç解ã確èªããããã«ïŒaggfuncã¯ãæå¹ãªå€ãè¿ãä»»æã®åŒã³åºãå¯èœã§ããå¯èœæ§ããããŸããïŒ ïŒ 'min'
ã 'max'
ãªã©ã®ããã䜿çšããããæååaggfungsã«å ããŠïŒã 以åãšã®éãã¯ãããŸããïŒ ïŒã€ãŸããåé
å¶éã¯ãã§ã«ååšããŠããŸããã§ãããïŒïŒ
ãŸããå®è£ ã®è©³çŽ°ããããŸããåãåã«å¯Ÿããè€æ°ã®
lambda
aggfuncsã¯ãŸã ãµããŒããããŠããŸããããåŸã§ä¿®æ£ã§ããŸãã
ãããããã¯ã¡ãã£ãšé¢åã§ããããããäžæçãªå¶éã§ããããããä¿®æ£ããããšãã§ããéããããã¯ããŸãããå¯èœæ§ããããŸãã
ããã§ãµãã¹ã¯ã©ã€ãããŠããã»ãšãã©ã®äººã¯ãéæšå¥šã®åäœã«ä»£ããããã€ãã®æ¹æ³ãæ¯æããŠãããšæããŸãã 人ã ã¯ãããå ·äœçã«ã©ãæããŸããïŒ
ãšã«ããã1ã€ã®ã¹ãããã§éçŽããŠååãå€æŽããããšã¯ãç¶æããããšãéåžžã«éèŠã ãšæããŸãã å€ãåäœãå®éã«ã¯ãªãã·ã§ã³ã§ã¯ãªãå Žåã¯ããã®ä»£æ¿æ段ã§å®è¡ã§ããŸãã
ãã®ç¹å®ã®è°è«ã«é¢ããŠããªãç§ãã¡ãå€ãæ¹æ³ããããã®æ°ããæ¹æ³ããããå€ãã®å©çãåŸãã®ããå ±æã§ããŸããïŒ
èŠããŠããªããããããŸããããSeriesGroupby.aggãšDataFrameGroupby.aggã¯ãèŸæžã®å€éšããŒéã§ç°ãªãæå³ãæã£ãŠãããšæããŸãïŒåã®éžæã§ããããããšãåºåã®ååã§ããïŒïŒã ãã®æ§æã䜿çšãããšãããŒã¯ãŒãã«åºååãäžè²«ããŠæå³ãããããšãã§ããŸãã
ã©ããããããããã®ç¹ã§ã¯å€ãæ¹æ³ã®æ¹ãåªããŠããŸããã
éãã¯**
ã§ããïŒ ããã§ãªããã°ãåãå¶éãå
±æãããŠãããšæããŸãã
以åã¯ã©ã®ããã«æ©èœããŠããŸãããïŒæ¢åã®dict-of-dictæ©èœïŒïŒ 泚æã¯äœããã®æ¹æ³ã§ä¿èšŒãããŸãããïŒ
ããŒã®äžŠã¹æ¿ããããã¯ãçŸåšPRã§è¡ã£ãŠããããšã§ãã
ç§ã®ç解ã確èªããããã«ïŒaggfuncã¯ãæå¹ãªå€ãè¿ãä»»æã®åŒã³åºãå¯èœã§ããå¯èœæ§ããããŸããïŒ
ãããéãã§ã
In [21]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})
In [22]: def aggfunc(x, myarg=None):
...: print(myarg)
...: return sum(x)
...:
In [23]: df.groupby("A").agg({'B': {'foo': aggfunc}}, myarg='bar')
/Users/taugspurger/sandbox/pandas/pandas/core/groupby/generic.py:1308: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super().aggregate(arg, *args, **kwargs)
None
Out[23]:
B
foo
A
a 3
å¥ã®ææ¡ã§ã¯ãåºåååçšã«**kwargs
ãäºçŽããŠããŸãã ãããã£ãŠã functools.partitial(aggfunc, myarg='bar')
ããå¿
èŠããããŸãã
ãããã§ãææ¡ãããã¢ãããŒãã¯æåã®å埩ã§ã¯ðã ãšæããŸãïŒãããŠãè€æ°ã®ã©ã ãå®è£ ã®å¶éãåãé€ããããšããã«ã眮ãæããšããŠæ¬åœã«å€§äžå€«ã§ãïŒ
æãåèã«ãªãã³ã¡ã³ã
ãã®äŸ¡å€ã«ã€ããŠã¯ãæ©èœãæžäŸ¡ååŽããªãããšã«ã匷ãè³æã§ãã
ç§ã«ãšã£ãŠã®å€§ããªçç±ã¯ãPythonã®é¢æ°ã®åå空éïŒç¹å®ã®å®è£ ã«é¢ä¿ãããã®ïŒãšååã®ããŒã¿ïŒå®è£ ã«ã€ããŠç¢ºå®ã«ç¥ããªãã¯ãã®ãã®ïŒãæ··åããããšã«ã€ããŠãéåžžã«å¥åŠãªããšããããšããããšã§ãã
'<lambda>'
ãšããååã®åïŒå Žåã«ãã£ãŠã¯è€æ°ã®åïŒã衚瀺ãããŠãããšããäºå®ã¯ãç§ã«æ·±å»ãªèªç¥çäžååãåŒãèµ·ãããŸããäžèŠãªïŒãããŠå ¬éãããïŒååãæã¡è¶ããããã®äžéã¹ãããããããããååå€æŽã®ã¢ãããŒãã¯ãããã§ãã ããã«ãå®è£ ã«äŸåããå¯èœæ§ãããããã確å®ã«äœç³»çã«ååãå€æŽããããšã¯å°é£ã§ãã
ãããé€ãã°ããã¹ããããdictæ©èœã¯ç¢ºãã«è€éã§ãããå®è¡ãããŠããã®ã¯è€éãªæäœã§ãã
TL; DRæžäŸ¡ååŽããªãã§ãã ããã :)