Numpy: BUG: numpy.percentile output is not sorted

Created on 12 Oct 2019 · 16Comments · Source: numpy/numpy

The output of numpy.percentile is not always sorted

Reproducing code example:

import numpy as np
q = np.arange(0, 1, 0.01) * 100
percentile = np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, q)
equals_sorted = np.sort(percentile) == percentile
print(equals_sorted)
assert equals_sorted.all()

Error message:

[ True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True True True True True True True True
True True True True True False False True True True True False
True True True False]
AssertionError Traceback (most recent call last)
in
1 q = np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, np.arange(0, 1, 0.01) * 100)
2 equals_sorted = np.sort(q) == q
----> 3 assert equals_sorted.all()

AssertionError:

Numpy/Python version information:

1.17.2 3.6.8 (v3.6.8:3c6b436a57, Dec 24 2018, 02:04:31)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]

00 - Bug numpy.lib good first issue

Source

A4Vision

Most helpful comment

Hey, there seems to have been an update to one of the stackexchange answers provided by @eric-wieser with a good alternative interpolation.
The thread includes a proof of monotonicity, and the proposed fix appears to address all of the issues mentioned.
If this would make sense for the issue, I would be willing to implement this as a first commit, or someone else could try it.
20191209_020250

arthertz on 9 Dec 2019

👍4

All 16 comments

Why would you expect it to be sorted? Percentile is elementwise - the outputs are in the order of the inputs.

eric-wieser on 12 Oct 2019

Hi !
Indeed, percentile is elmenet-wise - when considering q, which in our case is
np.arange(0, 1, 0.01) * 100.
I expect the output to be sorted because q is sorted.

A4Vision on 12 Oct 2019

👍2

There are some numerical errors within a single ULP, that differ for different inputs with the same output value. I doubt there is anything to be done about that.

seberg on 12 Oct 2019

A slightly reduced failing case:

In [40]: np.percentile(np.array([0, 1, 1, 2, 2, 3, 3 , 4, 5, 5, 1, 1, 9, 9 ,9, 8, 8, 7]) * 0.1, [89, 90, 95, 96, 98, 99])
Out[40]: array([0.9, 0.9, 0.9, 0.9, 0.9, 0.9])

In [41]: np.diff(_)
Out[41]:
array([-1.11022302e-16,  2.22044605e-16, -1.11022302e-16,  1.11022302e-16,
       -1.11022302e-16])

here showing non-sorted-ness via the diff.

I think there probably is something we can do about this. I think this comes down to the stability of these lines, which perform a lerp operation (essentially add(v_below*weights_below, v_above*weights_above)):

https://github.com/numpy/numpy/blob/b9fa88eec62e34e906689408096beb2450830d9a/numpy/lib/function_base.py#L3907-L3908

https://github.com/numpy/numpy/blob/b9fa88eec62e34e906689408096beb2450830d9a/numpy/lib/function_base.py#L3928-L3929

https://github.com/numpy/numpy/blob/b9fa88eec62e34e906689408096beb2450830d9a/numpy/lib/function_base.py#L3939-L3942

There are a bunch of tradeoffs to be made when linearly interpolating floating point values, but I suspect that there's a "correct" choice here, and we just haven't made it.

Some more background here: https://math.stackexchange.com/questions/907327/accurate-floating-point-linear-interpolation

eric-wieser on 14 Oct 2019

Yeah, I agree, +1 on reorganizing the operations so that it is strictly monotonic (numerically). Would be good if it is also no worse, or at least almost identical precision wise. I am sure we really do not have to worry about a few extra operations/speed here.

EDIT: Marked as good first issue. This is _only_ a good first issue if you are willing to dive into the intricacies of IEEE floating point numbers. But after that, this is probably a fairly straight forward reorganization within python code.

seberg on 14 Oct 2019

I would be interested in taking on this issue. I was looking at some of the failing cases and noticed that they all involved linearly interpolating between the same number. i.e. in Eric's example all of the percentiles he listed listed are located in between two 9s. Therefore I think the linear interpolation between them must be 9 exactly correct? fixing the problem of linearly interpolating between two number that are the same seems like it would deal with the issues presented in this bug and not cause a noticeable hit in performance. If however we want to ensure that the linear interpolation will be monotonic always, we can do that but It will require a piecewise function that I would think would decrease performance.

ngonzo95 on 16 Oct 2019

@ngonzo95 there should be a way to spell the arithmetic of the interpolation differently to achieve this, i.e. change/rearrange the formula that is used for the calculation (so that it is mathematically identical, but numerically guarantees monotonicity). No piecewise calculation should be necessary.

seberg on 16 Oct 2019

No piecewise calculation should be necessary.

It depends what your requirements on lerp are. Some that we may or may not care about:

monotonic ((lerp(a, b, t1) - lerp(a, b, t0)) * (b - a) * (t1 - t0) >= 0)
bounded (a <= lerp(a, b, t) <= b)
symmetric (lerp(a, b, t) == lerp(b, a, 1-t))

(0 <= t <= 1)

eric-wieser on 16 Oct 2019

Oh OK, I did not expect piecewise to be necessary, but do not know the intrinsicaties of this well enough I guess.

seberg on 16 Oct 2019

looking into it more I discovered that the function a + (b-a)*t has the property of being both monotonic (definition noted above) and consistent (lerp (a, a, t) = a). I believe this should be sufficient for the functions requirements. It seems one of the main draw backs of this function is that lerp(a, b, 1) !=b. However I think the way we are calculating weights ensures that 0<=t<1.

ngonzo95 on 16 Oct 2019

It seems one of the main draw backs of this function is that lerp(a, b, 1) !=b. However I think the way we are calculating weights ensures that 0<=t<1.

Note that unfortunately lerp(a, b. 1-eps) > b) is possible with that formulation.

eric-wieser on 17 Oct 2019

New to the open source.
Wanted to solve this as my good first issue. How can i contribute? Are there any prerequisites?

anshulshankar on 12 Nov 2019

I was looking at some of the failing cases and noticed that they all involved linearly interpolating between the same number

In scikit-learn, we recently stumbled in this issue: https://github.com/scikit-learn/scikit-learn/issues/15733

Since we expect q to be strictly increasing, we can apply np.maximum.accumulate reorder the array. However, if we could solve the issue in NumPy directly, this would be great. Is there anywhere that we can dig in to have a good fix?