Pandas: BUG : _nsorted for frame with duplicated values โ€‹โ€‹index

์— ๋งŒ๋“  2016๋…„ 06์›” 09์ผ  ยท  5์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: pandas-dev/pandas

์•„๋ž˜ ๊ธฐ๋Šฅ์€ ์ž˜๋ชป ๊ตฌํ˜„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ”„๋ ˆ์ž„์— ์ค‘๋ณต ๋œ ๊ฐ’์ด์žˆ๋Š” ์ธ๋ฑ์Šค๊ฐ€์žˆ๋Š” ๊ฒฝ์šฐ n ํ–‰์„ ์ดˆ๊ณผํ•˜๊ณ  ์ œ๋Œ€๋กœ ์ •๋ ฌ๋˜์ง€ ์•Š์€ ๊ฒฐ๊ณผ๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ nsmallest ๋ฐ nlargest for DataFrame์€์ด ํŠน์ • ๊ฒฝ์šฐ์— ์˜ฌ๋ฐ”๋ฅธ ํ”„๋ ˆ์ž„์„ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

def _nsorted(self, columns, n, method, keep):
    if not com.is_list_like(columns):
        columns = [columns]
    columns = list(columns)
    ser = getattr(self[columns[0]], method)(n, keep=keep)
    ascending = dict(nlargest=False, nsmallest=True)[method]
    return self.loc[ser.index].sort_values(columns, ascending=ascending,
                                           kind='mergesort')

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

@shankararul ์ฐธ์กฐ : https://github.com/pandas-dev/pandas/issues/15297

๋ชจ๋“  5 ๋Œ“๊ธ€

๊ณผ์—ฐ:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
   a  b
1  4  1
1  3  2

In [73]: df.nlargest(2, 'a')
Out[73]:
   a  b
1  4  1
1  4  1
1  3  2
1  3  2

(ํ–ฅํ›„ ์ฐธ์กฐ๋ฅผ ์œ„ํ•ด @ Tux1 ์‚ฌ์ด๋“œ ๋…ธํŠธ, ๋ฌธ์ œ๋ฅผ ์—ด โ€‹โ€‹๋•Œ ์ž‘์€ ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ์˜ˆ์ œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ด ํ•ญ์ƒ ์ข‹์Šต๋‹ˆ๋‹ค)
์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด PR์— ๊ด€์‹ฌ์ด ์žˆ์œผ์‹ญ๋‹ˆ๊นŒ?

์˜ˆ, ๊ณง ์ˆ˜์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค
์˜ˆ๋ฅผ ๋“ค์–ด ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค

Le 9 6 ์›” 2016 ร  23:30, Joris Van den Bossche [email protected] a รฉcrit :

๊ณผ์—ฐ:

[71]์—์„œ : df = pd.DataFrame ({ 'a': [1,2,3,4], 'b': [4,3,2,1]}, index = [0,0,1, 1])

[72]์—์„œ : df.nlargest (1, 'a')
์ถœ๋ ฅ [72] :
ab
1 4 1
1 3 2

[73]์—์„œ : df.nlargest (2, 'a')
์ถœ๋ ฅ [73] :
ab
1 4 1
1 4 1
1 3 2
1 3 2
(ํ–ฅํ›„ ์ฐธ์กฐ๋ฅผ ์œ„ํ•ด @ Tux1 ์‚ฌ์ด๋“œ ๋…ธํŠธ, ๋ฌธ์ œ๋ฅผ ์—ด โ€‹โ€‹๋•Œ ์ž‘์€ ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ์˜ˆ์ œ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ด ํ•ญ์ƒ ์ข‹์Šต๋‹ˆ๋‹ค)
์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด PR์— ๊ด€์‹ฌ์ด ์žˆ์œผ์‹ญ๋‹ˆ๊นŒ?

โ€”
๋‹น์‹ ์ด ์–ธ๊ธ‰ ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ์ด๋ฉ”์ผ์— ์ง์ ‘ ๋‹ต์žฅํ•˜๊ฑฐ๋‚˜ GitHub์—์„œ ๋ณด๊ฑฐ๋‚˜ ์Šค๋ ˆ๋“œ๋ฅผ ์Œ์†Œ๊ฑฐํ•˜์‹ญ์‹œ์˜ค.

๋‚ด ์ˆ˜์ •์€ ๊ทธ๋‹ค์ง€ ์šฐ์•„ํ•˜์ง€๋Š” ์•Š์ง€๋งŒ MultiIndex ๋ฐ ์ค‘๋ณต ๊ฐ’ ์ƒ‰์ธ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹ค๋ฅธ ์†”๋ฃจ์…˜์€ ์—†์Šต๋‹ˆ๋‹ค.

Sum์€ .19.2์—์„œ ์ž˜ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ count๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์˜๋ฏธ๊ฐ€์—†๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. df๋Š” "n"๋งŒํผ ๋ฐ˜๋ณต๋ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฒŒ ๋ฒ„๊ทธ์ž…๋‹ˆ๊นŒ, ์•„๋‹ˆ๋ฉด ์ œ๊ฐ€ ๋ญ”๊ฐ€ ์ž˜๋ชปํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ?

df.groupby(['a']).agg({'b':'count'}).nlargest(2, 'b')

a   
1   1
2   1
3   1
4   1
1   1
2   1
3   1
4   1

@shankararul ์ฐธ์กฐ : https://github.com/pandas-dev/pandas/issues/15297

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰