The function below has been incorrectly implemented. If the frame has an index with duplicated values, you will get a result with more than n
rows and not properly sorted. So nsmallest
and nlargest
for DataFrame doesn't return a correct frame in this particular case.
def _nsorted(self, columns, n, method, keep):
if not com.is_list_like(columns):
columns = [columns]
columns = list(columns)
ser = getattr(self[columns[0]], method)(n, keep=keep)
ascending = dict(nlargest=False, nsmallest=True)[method]
return self.loc[ser.index].sort_values(columns, ascending=ascending,
kind='mergesort')
Indeed:
In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])
In [72]: df.nlargest(1, 'a')
Out[72]:
a b
1 4 1
1 3 2
In [73]: df.nlargest(2, 'a')
Out[73]:
a b
1 4 1
1 4 1
1 3 2
1 3 2
(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?
Yes I will fix that soon
Sorry about example
Le 9 juin 2016 à 23:30, Joris Van den Bossche [email protected] a écrit :
Indeed:
In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])
In [72]: df.nlargest(1, 'a')
Out[72]:
a b
1 4 1
1 3 2In [73]: df.nlargest(2, 'a')
Out[73]:
a b
1 4 1
1 4 1
1 3 2
1 3 2
(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
my fix is not very elegant but I don't see any other solution to deal with MultiIndex and duplicated value index
Sum seems to work fine in .19.2 But with count, it doesn't seem to make sense. The df gets repeated as many times as the "n". Is that a bug or am i doing something wrong ?
df.groupby(['a']).agg({'b':'count'}).nlargest(2, 'b')
b
a
1 1
2 1
3 1
4 1
1 1
2 1
3 1
4 1
@shankararul see: https://github.com/pandas-dev/pandas/issues/15297
Most helpful comment
@shankararul see: https://github.com/pandas-dev/pandas/issues/15297