numpy 🚀 - Restructure percentile methods

I think this already exists? Using the wikipedia example:

>>> np.percentile(15, 20, 35, 40, 50], [5, 30, 40, 50, 100], interpolation='lower')
array([15, 20, 20, 35, 50])

eric-wieser on 13 Mar 2018

It does not. Look at example 2 in the wikipedia page:

>>> np.percentile([3, 6, 7, 8, 8, 10, 13, 15, 16, 20], [25,50,75,100], interpolation='lower')
array([ 7,  8, 13, 20])

When it should be [7,8,15,20]

It similarly fails in the third example

ricardoV94 on 13 Mar 2018

Nearest sounds a lot like "nearest"? Though there is always another point about how exactly the boundaries work.
EDIT: That is, where exactly is 0 and 100 considered to be, at the datapoint or before the datapoint? (that is IIRC, anyway there are a lot of annoying complexities here)

seberg on 13 Mar 2018

don't want to read it, I think the difference might be the C parameter further down, so if someone who knows this wants to add this....

seberg on 13 Mar 2018

Frankly, I think adding the C parameter would likely really be good. But mostly better documentation would be nice, and someone who really knows this stuff is needed....

seberg on 13 Mar 2018

I don't know if this has anything to do with the C-parameter, although I agree that the option of choosing it could be desirable.

I have found another thread that incidentally brought up this issue (Dec. 2016). It seems that the algorithm I am looking for (and which wikipedia calls nearest-rank) is mentioned in this commonly cited paper by Hyndman-Fan (H&F) as being the oldest and most studied definition of percentile (it was the one I learned in Stats course). It is a discontinuous function, so I think the parameter C does not apply here (I may be wrong).

Here is how it would look like against the other options provided by numpy that intuitively seem to compute a similar thing (i.e., 'lower', 'nearest'):

percentiles

ricardoV94 on 13 Mar 2018

To me it looks exactly like the C parameter on first sight, the nearest curve is more stretched then the H&F curve, which is expected since numpy uses 1 and apparently H&F uses 0.

seberg on 13 Mar 2018

If you want proof. Repeat the whole thing with the same values repeated 1000 times, my guess is they will converge.
EDIT: Or maybe not, don't have the patience or time to really think about it. But I still think it is the C parameter wikipedia mentions, so please proof me wrong :)

seberg on 13 Mar 2018

A graph like that would be a great addition to the percentile docs

edit: preferably one showing the open/closedness of the discontinuities

Note to readers: To keep this thread manageable, I've marked all discussions below about adding this graph to the docs as "resolved". The graph is now at the bottom of https://numpy.org/devdocs/reference/generated/numpy.percentile.html.

eric-wieser on 13 Mar 2018

@eric-wieser I don't mind making that graph. I will come back with something later today, should I post it in here?

ricardoV94 on 13 Mar 2018

@seberg I will be honest here, I don't know how the interpolation is being calculated based on the C-parameter. What makes me think that it is not related is that the C-parameter is only discussed in the linear interpolation section (Wikipedia), and both the Wikipedia and Hyndmand & Fan paper discuss the algorithm I requested in separate sections from the interpolation ones.

I don't know if there is any interpolation parameters that always give the same results as the algorithm I am interested in.

Even if there are, should this be the way used to get to it? Changing a 'strange' parameter to get the most common definition of percentile does not seem the best way to implement it imho.

ricardoV94 on 13 Mar 2018

@ricardoV94, maybe, but you can't just change the defaults, no matter how bad they are. We could expose something like method="H&K" to override both parameters at once.

The C parameters is where you define 0% and 100% to be with respect to the datapoints (on the data point or not, etc.). As parameter C on wikipedia, it may well be only for interpolation, but the same issue causes the difference here I am sure. C is dubious of course, a proper name might be something like range='min-max' or range='extrapolated' or probably something completely different. As I said, redo the plots with many many datapoints (possibly with tiny noise), and I think you will see them converge, since the range definition becomes less obvious.

seberg on 13 Mar 2018

@seberg I am fine with method="H&K" or maybe method="classic". Interpolation="none" could also make sense.

ricardoV94 on 13 Mar 2018

I'm not sure what the mechanism for including images in the docs is, or if there's any precedent for doing it.

I know you can run matplotlib code within the docs, which is how we do it elsewhere - which also ensures it remains synced to reality.

eric-wieser on 13 Mar 2018

Okay, I will think of the best code-image in that case.

The most problematic part are the open, closed markers for discontinuity, since matplotlib does not have a built-in function for that (afaik). Hard-coding them would make little sense in that case.

ricardoV94 on 13 Mar 2018

Maybe skip those for now then. It would be nice if matplotlib had some automatic support for those.

eric-wieser on 13 Mar 2018

Hopefully someone will have a better suggestion, that is still elegant regarding the discontinuity.

import matplotlib.pyplot as plt

a = [0,1,2,3]
p = np.arange(101)

plt.step(p, np.percentile(a, p, interpolation='linear'), label='linear')
plt.step(p, np.percentile(a, p, interpolation='higher'), label='higher', linestyle='--')
plt.step(p, np.percentile(a, p, interpolation='lower'), label='lower', linestyle='--')
plt.step(p, np.percentile(a, p, interpolation='nearest'), label='nearest', linestyle='-.',)
plt.step(p, np.percentile(a, p, interpolation='midpoint'), label='midpoint', linestyle='-.',)

plt.title('Interpolation methods for list: ' + str(a))
plt.xlabel('Percentile')
plt.ylabel('List item returned')
plt.yticks(a)
plt.legend()

ricardoV94 on 13 Mar 2018

I think the interpolation = 'linear' should be a regular not stepped line, but otherwise looks good. Can you make a PR adding that to the docs?

eric-wieser on 13 Mar 2018

In fact, step is causing misleading artefacts generally, so I'd be inclined to avoid it. linspace(0, 100, 60) would produce more accurate intermediate coordinates too

eric-wieser on 13 Mar 2018

I have no idea of how to make a PR.

Feel free to do it with your account, adding or discussing the suggested changes.

ricardoV94 on 13 Mar 2018

👍1

I think you can change C with something like this (test it out on something). Call the function on your percentiles, then plug it into the numpy version (which uses C=1, which is a no-op except correcting out of bound percentiles right now):

def scale_percentiles(p, num, C=0):
     """
     p : float
          percentiles to be used (within 0 and 100 inclusive)
     num : int
         number of data points.
     C : float
         parameter C, should be 0, 0.5 or 1. Numpy uses 1, matlab 0.5, H&K is 0.
     """
     p = np.asarray(p)
     fact = (num-1.+2*C)/(num-1)
     p *= fact
     p -= 0.5 * (fact-1) * 100
     p[p < 0] = 0
     p[p > 100] = 100
     return p

And voila, with "nearest" you will get your "H&F" and with linear you will get the plot from Wikipedia. (pending that I got something wrong, but I am pretty sure I got it right).

As I said, the difference is where you place the data points from 0-100 (evenly) with respect to the last point. For C=1 you put min(data) to 0th percentile, etc. I have no clue about "what makes more sense" it probably matters a bit of the general view. The name inclusive for 1 and exclusive for 0 makes a bit sense I guess (when you think about the total range of percentiles, since exclusive the possible range is outside the data range). C=1/2 is also exclusive in that sense though.

I would be for adding the C parameter, but I would want someone to come up with a descriptive name if possible. I would also not mind something like a "method" or so to make the best defaults obvious (combination of interpolation+C). Or, you we basically decide that most combinations are never used and not useful, fine then....

In the end my problem is: I want a statistician to tell me which methods have consensus (R has some stuff, but the last time someone came around here it was just a copy past of R doc or similar without setting it into numpy context at all, needless to say, it was useless for a general audience, citing papers would have been more helpfull).

seberg on 13 Mar 2018

I don't want to read that H&F paper (honestly it also does not look very slick to read), but I think you could look at it from a support point of view too. The numpy "nearest" (or any other) version does not have identical support (in the percentiles) for each data point, H&F has equal support for "nearest" and maybe for midpoint it would be C=1/2, not sure.
I keep repeating myself, I don't know if such a support argument (against C=1 such as numpy uses it), is actually a real reason.

EDIT: midpoint has equal support (for the area in between data points, not for the point itself) in numpy, so with "C=1"

seberg on 14 Mar 2018

@seberg It does not seem to work with me. Can you post your code showing it working?

ricardoV94 on 14 Mar 2018

Well, I got the sign wrong, in that code up there, so it was opposite (C=0 a no-op not C=1):

def scale_percentiles(p, num, C=0):
     """
     p : float
          percentiles to be used (within 0 and 100 inclusive)
     num : int
         number of data points.
     C : float
         parameter C, should be 0, 0.5 or 1. Numpy uses 1, matlab 0.5, H&F is 0.
     """
     p = np.asarray(p)
     fact = (num+1.-2*C)/(num-1)
     p *= fact
     p -= 0.5 * (fact-1) * 100
     p[p < 0] = 0
     p[p > 100] = 100
     return p
plt.figure()
plt.plot(np.percentile([0, 1, 2, 3], scale_percentiles(np.linspace(0, 100, 101), 5, C=0), interpolation='nearest'))
plt.plot(np.percentile([0, 1, 2, 3], scale_percentiles(np.linspace(0, 100, 101), 5, C=1), interpolation='nearest'))
plt.figure()
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=1), interpolation='linear'))
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=0.5), interpolation='linear'))
plt.plot(np.percentile([15, 20, 35, 40, 50], scale_percentiles(np.linspace(0, 100, 101), 5, C=0), interpolation='linear'))

seberg on 14 Mar 2018

@seberg Close but not there yet. For a = [0,1,2,3] and percentiles = [25, 50, 75, 100] , np.percentile (a, scale_percentiles(percentiles, len(a), C=0), interpolation='nearest) returns[0, 2, 3, 3], when it should return [0,1,2,3].

I had to make the list percentilesdtype=np.float or your function would give an error, but I don't think that is the issue.

The function for the classical method is simple:
Percentile / 100 * N --> If it's a whole number that is the index, if not, use the ceiling as the index.

Despite that, the C argument seems to be working as expected, so it could be implemented if people want to use it for the interpolation. I still would like a method='classic' or interpolation='none' that would work as the wikipedia one.

ricardoV94 on 14 Mar 2018

For debugging, this is my ugly non-numpy implementation of the classical method:

def percentile (arr, p):
    arr = sorted(arr)

    index = p /100 * len(arr)

    # If index is a whole number, and larger than zero, subtract one unit (due to 0-based indexing)
    if index%1 < 0.0001 and index//1 > 0:
        index -= 1

    return arr[int(index)]

and a more numpythonic one:

def indexes_classic(percentiles, set_size):
    percentiles = np.asarray(percentiles)

    indexes = percentiles / 100* set_size
    indexes[np.isclose(indexes%1, 0)] -= 1
    indexes = np.asarray(indexes, dtype=np.int)
    indexes[indexes < 0] = 0
    indexes[indexes > 100] = 100

    return indexes

ricardoV94 on 14 Mar 2018

Those differences sound like floating point/rounding issues (which you
seem aware of), and maybe my guess with C=0 was wrong and you want
C=0.5.
My point was to say where the difference comes from (The "C parameter"
IMO, though there are probably good reasons to dislike many
combinations). It was not to give you/implement a workaround.

As to the "classical" method, I frankly do not care much what classical
is supposed to be. For all I know classical just means "quite a few
people use it".

Solution wise, my first impression is that "classical" or whatever
name, just adds another confusing option with an unclear name. I hope
that this discussion could go in the direction of actually making all
good (common) options available to users in a clean and transparent
way. Best in a way that people actually might understand.

We can add one more method, but frankly I only half like it. When we
last added more methods (I don't remember what changed exactly) I
already delayed and hoped that someone would jump up and figure out
what we should have. Needless to say it never really happened. And now
I am trying to point to the differences and try to see how it might fit
with what we currently have.

So, my impression is (with possible problems with rounding and exact
percentile matches) we have (probably too) many "interpolation" options
and would require the "C parameter" or whatever you want to call it to
be able to do almost anything.
And I would be really happy if someone could tell me how all the
(common) "Methods" out there fall into those categories, it seems that
more then C=0,0.5,1 exist even, and maybe some even outside those
options....

Maybe I am going down the wrong lane, but adding "Method1" with an
unclear name that does not really tell anyone how it differs from the
other methods does not seem helpful to me (except for someone who
happens to already know the name "Method1" and are looking for it. And
please don't say that the "classic" is the one obvious one, there is
way too much variance in implementations out there.

Another way might be to deprecated "interpolation", but having a list
of methods is also much less nice then hinting "linear interpolation"
to say that it is not a step behaviour, etc.... And if we go that way,
I still want a reasonable overview.

You do not have to do it, but if we want to add a new method, we need a
way to add it that does not confuse everyone even more and is clear!

seberg on 14 Mar 2018

Let me summarize it then:

1) Right now numpy offers only one useful method: interpolation ='linear', and the others are just small variations around it that don't seem to really be used by anyone. Other packages have many more relevant options.

2) Adding the other values for C=0 or C=0.5, makes sense to me. All the interpolation methods can work in combination with them, although again they are probably never going to be used.

3) If one of the combos between interpolation methods and C argument, manages to replicate the classical method (the reference and wikipedia and my personal experience agree that it is the most commonly taught method), then I am happy with it. It can be stated in the docs that such combo produces the classical non-interpolation method. I am not sure if it's just due to float precision issues, but I appreciate your effort to tackle it in a more integrated way!

4) If none of the combos achieves the same result, then I think a different method would make sense. Possibly called interpolation='none' would be the less confusing.

In sum: the current options of numpy.percentile seem both rather confusing and limited. The paper mentioned above offers a good overview of other useful methods. Together with the wikipedia page, they could work as a starting point for the design of a more exhaustive and useful set of options to numpy.percentile. Hopefully someone would like to work on this task.

ricardoV94 on 14 Mar 2018

Does the current "nearest" make sense in some/any cases? If the spacing method ("C") or whatever makes such a big difference for linear interpolation/fractional stuff, I am maybe just surprised nobody ever did it for non-fractional approximations?! Is constant support all that important, and there a reason to dump the CDF inverse argument for the interpolation methods?

Combos are useless unless they are understandable and the commonly used easy to find, so i doubt it. For interpolation many options seem to exists (e.g. http://mathworld.wolfram.com/Quantile.html Q4 to Q9, I think R documentation is practically identical, but I think it is likely not complete, e.g. matlab...) though I have no clue if they actually all make sense ;).

The thing is "interpolation" points to what to do between exactly defined points, but there are many (oddly many) ways to place those points at least when using "linear interpolation", so it seems like a bad approach to add to it. You wanted a "nearest-rank" that sounds a lot (and is in spirit) interpolation="nearest", but the choice of the exact "plotting position" seems "non-standard", so it will be impossible to guess and thus a poor choice.

Then I would even prefer to agressively deprecate everything (except probably linear).

seberg on 14 Mar 2018

👍1

But, if we deprecate, I want to get it 100% right, and that might need a bit more clarity as to what exists, what should exist and what should definetly not exist.

seberg on 14 Mar 2018

I totally agree with you

ricardoV94 on 14 Mar 2018

@ricardoV94: do you have any opinions on the definitions of linear for the weighted quantile case proposed at #9211? There are some graphs there in the same style.

eric-wieser on 19 May 2018

Maybe @ricardoV94 can comment on it (that would be cool), but I think the issue is pretty orthogonal. Weights are probably simply frequency type weights, assuming there are no other reasonably defined weights for percentile (I don't see how), there should not be any ambiguity when implementing them, but I do not know for sure.

You could also try to ping josef-pkt on that PR and hope he has a quick comment whether he thinks it is a good idea/right.

seberg on 19 May 2018

If anyone wants to take it from here, I wrote an non-optimized function that computes the
9 percentile/quantile estimation methods described by Hyndman and Fan (1996) and also used in R.

Method 1 corresponds to the 'classical nearest rank method' as discussed in Wikipedia. Method 7 is equivalent to the current Numpy implementation (interpolation = 'linear'). The remaining methods of Numpy interpolation are not included (and they don't seem to be useful anyway).

def percentile(x, p, method=7):
    '''
    Compute the qth percentile of the data.

    Returns the qth percentile(s) of the array elements.

    Parameters
    ----------
    x : array_like
        Input array or object that can be converted to an array.
    p : float in range of [0,100] (or sequence of floats)
        Percentile to compute, which must be between 0 and 100 inclusive.
    method : integer in range of [1,9]
        This optional parameter specifies one of the nine sampling methods 
        discussed in Hyndman and Fan (1996). 

        Methods 1 to 3 are discontinuous:
        * Method 1: Inverse of empirical distribution function (oldest
        and most studied method).
        * Method 2: Similar to type 1 but with averaging at discontinuities.
        * Method 3: SAS definition: nearest even order statistic.

        Methods 4 to 9 are continuous and equivalent to a linear interpolation 
        between the points (pk,xk) where xk is the kth order statistic. 
        Specific expressions for pk are given below:
        * Method 4: pk=kn. Linear interpolation of the empirical cdf.
        * Method 5: pk=(k−0.5)/n. Piecewise linear function where the knots 
        are the values midway through the steps of the empirical cdf 
        (Popular amongst hydrologists, used by Mathematica?).
        * Method 6: pk=k/(n+1), thus pk=E[F(xk)]. The sample space is divided
        in n+1 regions, each with probability of 1/(n+1) on average
        (Used by Minitab and SPSS).
        * Method 7: pk=(k−1)/(n−1), thus pk=mode[F(xk)]. The sample space
        is divided into n-1 regions (This is the default method of 
        Numpy, R, S, and MS Excell).
        * Method 8: pk=(k−1/3)/(n+1/3), thus pk≈median[F(xk)]. The resulting
        estimates are approximately median-unbiased regardless of the
        distribution of x (Recommended by Hyndman and Fan (1996)).
        * Method 9: k=(k−3/8)/(n+1/4), thus pk≈F[E(xk)]if x is normal (?).
        The resulting estimates are approximately unbiased for the expected 
        order statistics if x is normally distributed (Used for normal QQ plots).

        References:
        Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages, 
        American Statistician 50, 361--365.
        Schoonjans, F., De Bacquer, D., & Schmid, P. (2011). Estimation of population
        percentiles. Epidemiology (Cambridge, Mass.), 22(5), 750.

        '''

    method = method-1    
    x = np.asarray(x)
    x.sort()
    p = np.array(p)/100

    n = x.size  
    m = [0, 0, -0.5, 0, 0.5, p, 1-p, (p+1)/3, p/4+3/8][method]

    npm = n*p+m
    j = np.floor(npm).astype(np.int)
    g = npm-j

    # Discontinuous functions
    if method < 3:
        yg0 = [0, 0.5, 0][method]
        y = np.ones(p.size)
        if method < 2:
            y[g==0] = yg0
        else:
            y[(g==0) & (j%2 == 0)] = yg0      
    # Continuous functions
    else:
        y = g

    # Adjust indexes to work with Python
    j_ = j.copy()
    j[j<=0] = 1
    j[j > n] = n
    j_[j_ < 0] = 0
    j_[j_ >= n] = n-1 

    return (1-y)* x[j-1] + y*x[j_]

The continuous methods can also be implemented more efficiently like this.

def percentile_continuous(x, p, method=7):
    '''
    Compute the qth percentile of the data.

    Returns the qth percentile(s) of the array elements.

    Parameters
    ----------
    x : array_like
        Input array or object that can be converted to an array.
    p : float in range of [0,100] (or sequence of floats)
        Percentile to compute, which must be between 0 and 100 inclusive.
    method : integer in range of [4,9]
        This optional parameter specifies one of the 5 continuous sampling
        methods discussed in Hyndman and Fan (1996). 
        '''

    x = np.asarray(x)
    x.sort()
    p = np.asarray(p)/100
    n = x.size

    if method == 4:
        r = p * n
    elif method == 5:
        r = p * n + .5
    elif method == 6:
        r = p * (n+1)
    elif method == 7:
        r = p * (n-1) + 1
    elif method == 8:
        r = p * (n+1/3) + 1/3
    elif method == 9:
        r = p * (n+1/4) + 3/8

    index = np.floor(r).astype(np.int)

    # Adjust indexes to work with Python
    index_ = index.copy()
    index[index_ <= 0] = 1
    index[index_  > n] = n
    index_[index_ < 0] = 0
    index_[index_ >= n] = n-1

    i = x[index - 1]
    j = x[index_]

    return i + r%1* (j-i)

Anyone wants to take it from here? I am not qualified to do so.

ricardoV94 on 19 May 2018

As mentioned in the previous post, it seems that numpy's current default implementation of quantile matches that in R.

In R:

> quantile(c(15, 20, 35, 40, 50), probs=c(0.05, 0.3, 0.4, 0.5, 1))
  5%  30%  40%  50% 100% 
  16   23   29   35   50 
> quantile(c(3, 6, 7, 8, 8, 10, 13, 15, 16, 20), probs=c(0.25, 0.5, 0.75, 1))
  25%   50%   75%  100% 
 7.25  9.00 14.50 20.00
> quantile(c(3, 6, 7, 8, 8, 9, 10, 13, 15, 16, 20), probs=c(0.25, 0.5, 0.75, 1))
 25%  50%  75% 100% 
 7.5  9.0 14.0 20.0

In np.quantile:

>>> np.quantile([15, 20, 35, 40, 50], q=[0.05, 0.3, 0.4, 0.5, 1])
array([16., 23., 29., 35., 50.])
>>> np.quantile([3, 6, 7, 8, 8, 10, 13, 15, 16, 20], q=[0.25, 0.5, 0.75, 1])
array([ 7.25,  9.  , 14.5 , 20.  ])
>>> np.quantile([3, 6, 7, 8, 8, 9, 10, 13, 15, 16, 20], q=[0.25, 0.5, 0.75, 1])
array([ 7.5,  9. , 14. , 20. ])

which of course do not reproduce the examples given in Wikipedia:
https://en.wikipedia.org/wiki/Percentile

In fact, if you go to the R help page for quantile https://www.rdocumentation.org/packages/stats/versions/3.5.0/topics/quantile
you'd see that the R default method (Type 7) sets the boundary conditions identical to how np.quantile sets it: p_k = (k-1) / (n-1), where n is the sample size, and k=1 denotes the smallest value, while k=n the largest. That means the smallest value in the sorted array is pinned at quantile=0, and the largest is pinned at quantile=1.

Also as mentioned in the previous post, you could reproduce the 3 examples in Wikipedia with Type 1:

> quantile(c(15, 20, 35, 40, 50), probs=c(0.05, 0.3, 0.4, 0.5, 1), type=1)
  5%  30%  40%  50% 100% 
  15   20   20   35   50 
> quantile(c(3, 6, 7, 8, 8, 10, 13, 15, 16, 20), probs=c(0.25, 0.5, 0.75, 1), type=1)
 25%  50%  75% 100% 
   7    8   15   20 
> quantile(c(3, 6, 7, 8, 8, 9, 10, 13, 15, 16, 20), probs=c(0.25, 0.5, 0.75, 1), type=1)
 25%  50%  75% 100% 
   7    9   15   20

That raises some interesting questions:

1.) should np.quantile's default track the default of R.quantile?
2.) should np.quantile switch to the Type 1 algorithm?

Since even Wikipedia itself agrees that there's no standard definition of percentile, I think as long as the algorithm is sound and the user is aware of how it works, neither (1) or (2) matters that much. I'm more in favor of (1) because Python and R are two of the most popular data analytics platforms out there, and it'd be nice if they can vet each other. Given that, I think (2) is unnecessary.

chunweiyuan on 20 May 2018

Yes, both R and Numpy default to method 7 and it should be kept like that. The question is about adding the other methods or not.

ricardoV94 on 20 May 2018

If anyone is interest, I put up an independent module with the 9 percentile methods, here. Feel free to use it or adapt to Numpy if you know how.

ricardoV94 on 17 Jun 2018

Thank you @ricardoV94 .

So, just for kicks, I did a poll at work on the R users. Of the 20 people that responded, 20 use only the default method in quantile. They range from Masters students in public health to PhD researchers in Statistics.

Personally, I'm not sure if it's worth the effort for numpy to support 9 different ways to compute quantile. I think most users will just use the default.

chunweiyuan on 30 Jun 2018

For what it's worth there is the scipy.stats.mstats.mquantiles function which supports 6 of the 9 methods (the continuous ones) and the doc states very explicitly the links with the R implementation.

albertcthomas on 5 Dec 2018

@albertcthomas ah, that is good to know. Although, I think ideally we would hide this complexity in numpy a bit. And we mostly need to fix the non-contiguous versions IIRC. Because those basically do not give the most common methods.

seberg on 5 Dec 2018

Yes indeed, numpy may not necessarily have to support these methods if they are implemented in the scipy stats module.

albertcthomas on 5 Dec 2018

Personally I would be in favor of having a method computing the quantile from the generalized inverse of the cumulative distribution function. The fact that such a method is not available lead me to this issue :).

albertcthomas on 5 Dec 2018

@albertcthomas if you have any hints/knowledge about this, please say so! We are a bit stuck because of a lack of clarity what is actually good default. And I do think it is a pretty annoying issue.

Most importantly we need a few good defaults. And that probably means implementing 2-3 methods (completely revamping the non-contiguous ones). I am OK with supporting more or more complex stuff, but I would love if we can decide on few "typical/good" ones.

seberg on 5 Dec 2018

I would say that the linear method (current default) and the inverse of the cumulative distribution function (which I was looking for when I created this Issue, as well as @albertcthomas if I understand correctly) would suffice. Basically it allows one to choose whether they want interpolation or not.

And the other alternatives currently implemented should definitely be removed.

ricardoV94 on 5 Dec 2018

The inverse of the cumulative distribution function should definitely be added. It is one of the most popular estimators of a quantile from a given sample of observations in statistics.

albertcthomas on 5 Dec 2018

👍1

And the other alternatives currently implemented should definitely be removed.

@ricardoV94 are you saying this because none of the alternatives are referenced in Wikipedia nor the Hyndman and Fan's paper?

albertcthomas on 2 Jan 2019

Yes, afaik they are not implemented in any other package.

I don't see why anyone would want to use those methods, and their names are
also potentially misleading.

Albert Thomas notifications@github.com escreveu no dia quarta, 2/01/2019
à(s) 14:18:

And the other alternatives currently implemented should definitely be
removed.

@ricardoV94 https://github.com/ricardoV94 are you saying this because
none of the alternatives are referenced in Wikipedia nor the Hyndman and
Fan's paper?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/10736#issuecomment-450861068, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AbpAmfUoJNk3YHOSHNeVN03Va5wtvkHQks5u_LGugaJpZM4SnVpE
.

ricardoV94 on 2 Jan 2019

Thanks! Why not open a PR to add the inverse of the cumulative distribution as a method available in np.percentile? while keeping this issue opened if we want to keep discussing about alternatives (except the current default that should stay the default). How is deprecation handled in numpy?

albertcthomas on 3 Jan 2019

Some more information here - Python 3.8 added statistics.quantiles - we should look into adding an equivalent mode to np.quantile

eric-wieser on 27 Apr 2019

👍1

The way forward here is probably to add a method kwarg mirroring the statistics one, and possible adding 0-2 more (in which case it would be good to ping the original authors over at python).

I am not sure if the defaults match up between ours and theirs, which would be a shame if they do not, but it still seems like the best idea (and pretty much what we had in mind anyway). 0-2 new "methods" would be OK to add as well. In which case it would be good to ping the python statistics people on the actual names...

PRs very welcome, I would like this to move forward, but I will not do it in the immediate future.

seberg on 6 Jun 2019

@eric-wieser I note that you have a couple of related PRs outstanding, do any of them deal with this?

I'm going to push this off to 1.19 so it isn't a blocker. But that doesn't mean it cannot be fixed for 1.18 :)

charris on 26 Nov 2019

@charris: Which PRs do you have in mind?

eric-wieser on 26 Nov 2019

I do not think there is any in this direction yet, unfortunately.

seberg on 26 Nov 2019

Numpy: Restructure percentile methods

All 53 comments

Related issues