Pandas: BUG: in _nsorted for frame with duplicated values index

Created on 9 Jun 2016  ·  5Comments  ·  Source: pandas-dev/pandas

The function below has been incorrectly implemented. If the frame has an index with duplicated values, you will get a result with more than n rows and not properly sorted. So nsmallest and nlargest for DataFrame doesn't return a correct frame in this particular case.

def _nsorted(self, columns, n, method, keep):
    if not com.is_list_like(columns):
        columns = [columns]
    columns = list(columns)
    ser = getattr(self[columns[0]], method)(n, keep=keep)
    ascending = dict(nlargest=False, nsmallest=True)[method]
    return self.loc[ser.index].sort_values(columns, ascending=ascending,
                                           kind='mergesort')
Bug

Most helpful comment

All 5 comments

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
   a  b
1  4  1
1  3  2

In [73]: df.nlargest(2, 'a')
Out[73]:
   a  b
1  4  1
1  4  1
1  3  2
1  3  2

(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?

Yes I will fix that soon
Sorry about example

Le 9 juin 2016 à 23:30, Joris Van den Bossche [email protected] a écrit :

Indeed:

In [71]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[4,3,2,1]}, index=[0,0,1,1])

In [72]: df.nlargest(1, 'a')
Out[72]:
a b
1 4 1
1 3 2

In [73]: df.nlargest(2, 'a')
Out[73]:
a b
1 4 1
1 4 1
1 3 2
1 3 2
(@Tux1 side note for future reference, it is always nice to provide a small reproducible example when opening an issue)
Interested in doing a PR to fix this?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

my fix is not very elegant but I don't see any other solution to deal with MultiIndex and duplicated value index

Sum seems to work fine in .19.2 But with count, it doesn't seem to make sense. The df gets repeated as many times as the "n". Is that a bug or am i doing something wrong ?

df.groupby(['a']).agg({'b':'count'}).nlargest(2, 'b')

    b
a   
1   1
2   1
3   1
4   1
1   1
2   1
3   1
4   1
Was this page helpful?
0 / 5 - 0 ratings