Numpy: reduceat cornercase (Trac #236)

Created on 19 Oct 2012  ·  49Comments  ·  Source: numpy/numpy

_Original ticket http://projects.scipy.org/numpy/ticket/236 on 2006-08-07 by trac user martin_wiechert, assigned to unknown._

.reduceat does not handle repeated indices correctly. When an index is repeated the neutral element of the operation should be returned. In the example below [0, 10], not [1, 10], is expected.

In [1]:import numpy

In [2]:numpy.version.version
Out[2]:'1.0b1'

In [3]:a = numpy.arange (5)

In [4]:numpy.add.reduceat (a, (1,1))
Out[4]:array([ 1, 10])
01 - Enhancement 23 - Wish List numpy.core

Most helpful comment

The main motivation for reduceat is to avoid a loop over reduce for maximum speed. So I am not entirely sure a wrapper of a for loop over reduce would be a very useful addition to Numpy. It would go against reduceat main purpose.

Moreover the logic for reduceat existence and API, as a fast vectorized replacement for a loop over reduce, is clean and useful. I would not deprecate it, but rather fix it.

Regarding reduceat speed, let's consider a simple example, but similar to some real-world cases I have in my own code, where I use reduceat:

n = 10000
arr = np.random.random(n)
inds = np.random.randint(0, n, n//10)
inds.sort()

%timeit out = np.add.reduceat(arr, inds)
10000 loops, best of 3: 42.1 µs per loop

%timeit out = piecewise_reduce(np.add, arr, inds)
100 loops, best of 3: 6.03 ms per loop

This is a time difference of more than 100x and illustrates the importance of preserving reduceat efficiency.

In summary, I would prioritize fixing reduceat over introducing new functions.

Having a start_indices and end_indices, altough useful in some cases, is often redundant and I would see it as a possible addition, but not as a fix for the current reduceat inconsistent behaviour.

All 49 comments

_@teoliphant wrote on 2006-08-08_

Unfortunately, perhaps, the reduceat method of NumPy follows the behavior of the reduceat method of Numeric for this corner case.

There is no facility for returning the "identity" element of the operation in cases of index-equality. The defined behavior is to return the element given by the first index if the slice returns an empty sequence. Therefore, the documented and actual behavior of reduceat in this case is to construct

[a[1], add.reduce(a[1:])]

This is a feature request.

_trac user martin_wiechert wrote on 2006-08-08_

also see ticket #835

Milestone changed to 1.1 by @alberts on 2007-05-12

Milestone changed to Unscheduled by @cournape on 2009-03-02

I think this is closely connected to #835: If one of the indices is len(a), reduceat cannot output the element at that index, which is needed if the index len(a) appears or is repeated at the end of the indices.

Some solutions:

  • an option to reduceat to not set any value in the output where end - start == 0
  • an option to set the output to a given fixed value where end - start == 0
  • a where parameter, like in ufunc(), which masks which outputs should be calculated at all.

Has there been any more thought on this issue? I would be interested in having the option to set the output to the identity value (if it exists) where end - start == 0.

I strongly support the change of the reduceat behaviour as suggested in this long-standing open issue. It looks like a clear bug or obvious design mistake which hinders the usefulness of this great Numpy construct.

reduceat should behave consistently for all indices. Namely, for every index i, ufunc.reduceat(a, indices) should return ufunc.reduce(a[indices[i]:indices[i+1]]).

This should also be true for the case indices[i] == indices[i+1]. I cannot see any sensible reason why, in this case, reduceat should return a[indices[i]] instead of ufunc.reduce(a[indices[i]:indices[i+1]]).

See also HERE a similar comment by Pandas creator Wes McKinney.

Wow, this is indeed terrible and broken.
.
We'd need some discussion on the mailing list, but I at least would be
totally in favor of making that issue a FutureWarning in the next release
and fixing the behavior a few releases later. We'd need someone to take the
lead on starting that discussion and writing the patch. Perhaps that's you?

Thanks for the supportive response. I can start a discussion if this helps, but unfortunately am not up to patching the C code.

What do you intend for ufuncs without an identity, such as np.maximum?

For such functions, an empty reduction should be an error, as it already is
when you use .reduce() instead of .reduceat().

Indeed, the behaviour should be driven by the consistency with ufunc.reduce(a[indices[i]:indices[i+1]]), which is what every user would expect. So this does not require new design decisions. It really looks just like a long standing bug fix to me. Unless anybody can justify the current inconsistent behaviour.

@njsmith I am unable to sign up to the Numpy list. I sent my address here https://mail.scipy.org/mailman/listinfo/numpy-discussion but I never get any "email requesting confirmation". Not sure whether one need special requirements to subscribe...

@divenex: did you check your spam folder? (I always forget to do that...) Otherwise I'm not sure what could be going wrong. There definitely shouldn't be any special requirements to subscribe beyond "has an email address". If you still can't get it to work then speak up and we'll try to track down the relevant sysadmin... We definitely want to know if it's broken.

A version of reduceat that is consistent with ufunc.reduce(a[indices[i]:indices[i+1]]) would be really, really nice. It would be so much more useful! Either an argument to select the behavior or a new function (reduce_intervals? reduce_segments? ...?) would avoid breaking backwards incompatibility.

I'd perhaps be tempted to deprecate np.ufunc.reduceat alltogether - it seems more useful to be able to specify a set of start and end indices, to avoid cases where indices[i] > indices[i+1]. Also, the name at suggests a much greater similarity to at than atually exists

What I'd propose as a replacement is np.piecewise_reduce np.reducebins, possibly pure-python, which basically does:

def reducebins(func, arr, start=None, stop=None, axis=-1, out=None):
    """
    Compute (in the 1d case) `out[i] = func.reduce(arr[start[i]:stop[i]])`

    If only `start` is specified, this computes the same reduce at `reduceat` did:

        `out[i]  = func.reduce(arr[start[i]:start[i+1]])`
        `out[-1] = func.reduce(arr[start[-1]:])`

    If only `stop` is specified, this computes:

        `out[0] = func.reduce(arr[:stop[0]])`
        `out[i] = func.reduce(arr[stop[i-1]:stop[i]])`

    """
    # convert to 1d arrays
    if start is not None:
        start = np.array(start, copy=False, ndmin=1, dtype=np.intp)
        assert start.ndim == 1
    if stop is not None:
        stop = np.array(stop, copy=False, ndmin=1, dtype=np.intp)
        assert stop.ndim == 1

    # default arguments that do useful things
    if start is None and stop is None:
        raise ValueError('At least one of start and stop must be specified')
    elif stop is None:
        # start only means reduce from one index to the next, and the last to the end
        stop = np.empty_like(start)
        stop[:-1] = start[1:]
        stop[-1] = arr.shape[axis]
    elif start is None:
        # stop only means reduce from the start to the first index, and one index to the next
        start = np.empty_like(stop)
        start[1:] = stop[:-1]
        start[0] = 0
    else:
        # TODO: possibly confusing?
        start, stop = np.broadcast_arrays(start, stop)

    # allocate output - not clear how to do this safely for subclasses
    if not out:
        sh = list(arr.shape)
        sh[axis] = len(stop)
        sh = tuple(sh)
        out = np.empty(shape=sh)

    # below assumes axis=0 for brevity here
    for i, (si, ei) in enumerate(zip(start, stop)):
        func.reduce(arr[si:ei,...], out=out[i, ...], axis=axis)
    return out

Which has the nice properties that:

  • np.add.reduce(arr) is the same as np.piecewise_reduce(np.add, arr, 0, len(arr))
  • np.add.reduceat(arr, inds) is the same as np.piecewise_reduce(np.add, arr, inds)
  • np.add.accumulate(arr) is the same as np.piecewise_reduce(np.add, arr, 0, np.arange(len(arr)))

Now, does this want to go through the__array_ufunc__ machinery? Most of what needs to be handled should be already covered by func.reduce - the only issue is the np.empty line, which is a problem that np.concatenate shares.

That sounds like a nice solution to me from an API perspective. Even just being able to specify two sets of indices to reduceat would suffice. From an implementation perspective? Well it's not very hard to change the current PyUFunc_Reduceat to support having two sets of inds, if that provides benefit. If we really see the advantage in supporting the accumulate-like use-case efficiently, it would not be hard to do that either.

Marten proposed something similar to this in a similar discussion from ~1
year ago, but he also mentioned the possibility of adding a 'step` option:

http://numpy-discussion.10968.n7.nabble.com/Behavior-of-reduceat-td42667.html

Things I like (so +1 if anyone is counting) from your proposal:

  • Creating a new function, rather than trying to salvage the existing
    one.
  • Making the start and end indices arguments specific, rather than
    magically figuring them out from a multidimensional array.
  • The defaults for the None indices are very neat.

Things I think are important to think hard about for this new function:

  • Should we make 'step' an option? (I'd say yes)
  • Does it make sense for the indices arrays to broadcast, or must they
    be 1D?
  • Should this be a np function, or a ufunc method? (I think I prefer it
    as a method)

And from the bike shedding department, I like better:

  • Give it a more memorable name, but I have no proposal.
  • Use 'start' and 'stop' (and 'step' if we decide to go for it) for
    consistency with np.arange and Python's slice.
  • Dropping the _indices from the kwarg names.

Jaime

On Thu, Apr 13, 2017 at 1:47 PM, Eric Wieser notifications@github.com
wrote:

I'd perhaps be tempted to deprecate np.ufunc.reduceat alltogether - it
seems more useful to be able to specify a set of start and end indices, to
avoid cases where indices[i] > indices[i+1]. Also, the nameatsuggests a
much greater similarity toat` than atually exists

What I'd propose as a replacement is np.piecewise_reduce, which basically
does:

def piecewise_reduce(func, arr, start_indices=None, end_indices=None, axis=-1, out=None):
if start_indices is None and end_indices is None:
start_indices = np.array([0], dtype=np.intp)
end_indices = np.array(arr.shape[axis], dtype=np.intp)
elif end_indices is None:
end_indices = np.empty_like(start_indices)
end_indices[:-1] = start_indices[1:]
end_indices[-1] = arr.shape[axis]
elif start_indices is None:
start_indices = np.empty_like(end_indices)
start_indices[1:] = end_indices
end_indices[0] = 0
else:
assert len(start_indices) == len(end_indices)

if not out:
    sh = list(arr.shape)
    sh[axis] = len(end_indices)
    out = np.empty(shape=sh)

# below assumes axis=0 for brevity here
for i, (si, ei) in enumerate(zip(start_indices, end_indices)):
    func.reduce(arr[si:ei,...], out=alloc[i, ...], axis=axis)
return out

Which has the nice properties that:

  • np.ufunc.reduce is the same as np.piecewise_reduce(func, arr, 0,
    len(arr))
  • np.ufunc.accumulate is the same as `np.piecewise_reduce(func, arr,
    np.zeros(len(arr)), np.arange(len(arr)))

Now, does this want to go through the__array_ufunc__ machinery? Most of
what needs to be handled should be already covered by func.reduce - the
only issue is the np.empty line, which is a problem that np.concatenate
shares.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/834#issuecomment-293867746, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADMGdtjSCodONyu6gCpwofdBaJMCIKa-ks5rvgtrgaJpZM4ANcqc
.

--
(__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.

Use 'start' and 'stop'

Done

Should we make 'step' an option

Seems like a pretty narrow use case

Does it make sense for the indices arrays to broadcast, or must they be 1D

Updated. > 1d is obviously bad, but I think we should allow 0d and broadcasting, for cases like accumulate.

Should this be a np function, or a ufunc method? (I think I prefer it
as a method)

Every ufunc method is one more thing for __array_ufunc__ to handle.

The main motivation for reduceat is to avoid a loop over reduce for maximum speed. So I am not entirely sure a wrapper of a for loop over reduce would be a very useful addition to Numpy. It would go against reduceat main purpose.

Moreover the logic for reduceat existence and API, as a fast vectorized replacement for a loop over reduce, is clean and useful. I would not deprecate it, but rather fix it.

Regarding reduceat speed, let's consider a simple example, but similar to some real-world cases I have in my own code, where I use reduceat:

n = 10000
arr = np.random.random(n)
inds = np.random.randint(0, n, n//10)
inds.sort()

%timeit out = np.add.reduceat(arr, inds)
10000 loops, best of 3: 42.1 µs per loop

%timeit out = piecewise_reduce(np.add, arr, inds)
100 loops, best of 3: 6.03 ms per loop

This is a time difference of more than 100x and illustrates the importance of preserving reduceat efficiency.

In summary, I would prioritize fixing reduceat over introducing new functions.

Having a start_indices and end_indices, altough useful in some cases, is often redundant and I would see it as a possible addition, but not as a fix for the current reduceat inconsistent behaviour.

I don't think allowing start and stop indices to come from different arrays
would make a big difference to efficiency if implemented in the C.

On 13 April 2017 at 23:40, divenex notifications@github.com wrote:

The main motivation for reduceat is to avoid a loop over reduce for
maximum speed. So I am not entirely sure a wrapper of a for loop over
reduce would be a very useful addition to Numpy. It would go against
reduceat main purpose.

Moreover the logic for reduceat existence and API, as a fast vectorized
replacement for a loop over reduce, is clean and useful. I would not
deprecate it, but rather fix it.

Regarding reduceat speed, let's consider a simple example, but similar to
some real-world cases I have in my own code, where I use reduceat:

n = 10000
arr = np.random.random(n)
inds = np.random.randint(0, n, n//10)
inds.sort()
%timeit out = np.add.reduceat(arr, inds)10000 loops, best of 3: 42.1 µs per loop
%timeit out = piecewise_reduce(np.add, arr, inds)100 loops, best of 3: 6.03 ms per loop

This is a time difference of more than 100x and illustrates the importance
of preserving reduceat efficiency.

In summary, I would prioritize fixing reduceat over introducing new
functions.

Having a start_indices and end_indices, altough useful in some cases, is
often redundant and I would see it as a possible addition, but not as a fix
for the current reduceat inconsistent behaviour.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/834#issuecomment-293898215, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAEz6xPex0fo2y_MqVHbNP5YNkJ0CBJrks5rviW-gaJpZM4ANcqc
.

This is a time difference of more than 100x and illustrates the importance of preserving reduceat efficiency.

Thanks for that - I guess I underestimated the overhead associated with the first stage of a reduce call (that only happens once for reduceat).

Not an argument against a free function, but certainly an argument against implementing it in pure python

but not as a fix for the current reduceat inconsistent behaviour.

The problem is, that it's tricky to change the behaviour of code that's been around for so long.


Another possible extension: when indices[i] > indices[j], compute the inverse:

    for i, (si, ei) in enumerate(zip(start, stop)):
        if si >= ei:
            func.reduce(arr[si:ei,...], out=out[i, ...], axis=axis)
        else:
            func.reduce(arr[ei:si,...], out=out[i, ...], axis=axis)
            func.inverse(func.identity, out[i, ...], out=out[i, ...])

Where np.add.inverse = np.subtract,np.multiply.inverse = np.true_divide. This results in the nice property that

func.reduce(func.reduceat(x, inds_from_0)) == func.reduce(x))

For example

a = [1, 2, 3, 4]
inds = [0, 3, 1]
result = np.add.reduceat(a, inds) # [6, -5, 9] == [(1 + 2 + 3), -(3 + 2), (2 + 3 + 4)]

The problem is, that it's tricky to change the behaviour of code that's been around for so long.

This is partially why in the e-mail thread I suggested to give special meaning to a 2-D array of indices in which the extra dimension is 2 or 3: it then is (effectively) interpreted as a stack of slices. But I realise this is also somewhat messy and of course one might as well have a reduce_by_slice, slicereduce, or reduceslice method.

p.s. I do think anything that works on many ufuncs should be a method, so that it can be passed through __array_ufunc__ and be overridden.

Actually, a different suggestion that I think is much better: rather than salvaging reduceat, why not add a slice argument (or start, stop, step) to ufunc.reduce!? As @eric-wieser noted, any such implementation means we can just deprecate reduceat altogether, as it would just be

add.reduce(array, slice=slice(indices[:-1], indices[1:])

(where now we are free to make the behaviour match what is expected for an empty slice)

Here, one would broadcast the slice if it were 0-d, and might even consider passing in tuples of slices if a tuple of axes was used.

EDIT: made the above slice(indices[:-1], indices[1:]) to allow for extension to a tuple of slices (slice can hold arbitrary data, so this would work fine).

I would still find a fix to reduceat, to make it a proper 100% vectorized version of reduce, the most logical design solution. Alternatively, to avoid breaking code (but see below), an equivalent method named like reducebins could be created, which is simply a corrected version of reduceat. In fact, I agree with @eric-wieser that the naming of reduceat conveys more connection to the at function than there is.

I do understand the need not to break code. But I must say that I find it hard to imagine that much code depended on the old behavior, given that it simply did not make logical sense, and I would simply call it a long standing bug. I would expect that code using reduceat just made sure indices were not duplicated, to avoid a nonsense result from reduceat, or fixed the output as I did using out[:-1] *= np.diff(indices) > 0. Of course I would be interested in a user case where the old behavior/bug was used as intended.

I am not fully convinced about @mhvk slice solution because it introduces a non-standard usage for the slice construct. Moreover it would be inconsistent with the current design idea of reduce, which is to _"reduce a‘s dimension by one, by applying ufunc along one axis."_

I also do not see a compelling user case for both start and end indices. In fact, I see the nice design logic of the current reduceat method conceptually similar to np.histogram, where bins, which _"defines the bin edges,"_ are replaced by indices, which also represent the bins edges, but in index space rather than value. And reduceat applies a function to the elements contained inside each pair of bins edges. The histogram is an extremely popular construct, but it does not need, and in Numpy does not include, an option to pass two vectors of left and right edges. For the same reaason I doubt there is a strong need for both edges in reduceat or its replacement.

The main motivation for reduceat is to avoid a loop over reduce for maximum speed. So I am not entirely sure a wrapper of a for loop over reduce would be a very useful addition to Numpy. It would go against reduceat main purpose.

I agree with @divenex here. The fact that reduceat requires indices to be sorted and overlapping is a reasonable constraint to ensure that the loop can be computed cache efficient manner with a single pass over the data. If you want overlapping bins, there are almost certainly better ways to compute the desired operation (e.g., rolling window aggregations).

I also agree that the cleanest solution is to define a new method such as reducebins with a fixed API (and deprecate reduceat), and to not to try to squeeze it into reduce which already does something different.

Hi everyone,

I want to nip at the bud the discussion that this is a bug. This is documented behaviour from the docstring:

For i in ``range(len(indices))``, `reduceat` computes
``ufunc.reduce(a[indices[i]:indices[i+1]])``, which becomes the i-th
generalized "row" parallel to `axis` in the final result (i.e., in a
2-D array, for example, if `axis = 0`, it becomes the i-th row, but if
`axis = 1`, it becomes the i-th column).  There are three exceptions to this:

* when ``i = len(indices) - 1`` (so for the last index),
  ``indices[i+1] = a.shape[axis]``.
* if ``indices[i] >= indices[i + 1]``, the i-th generalized "row" is
  simply ``a[indices[i]]``.
* if ``indices[i] >= len(a)`` or ``indices[i] < 0``, an error is raised.

As such, I oppose any attempt to change the behaviour of reduceat.

A quick github search shows many, many uses of the function. Is everyone here certain that they all use only strictly increasing indices?

Regarding the behaviour of a new function, I would argue that without separate start/stop arrays, the functionality is severely hampered. There are many situations where one would want to measure values in overlapping windows that are not regularly arrayed (so rolling windows would not work). For example, regions of interest determined by some independent method. And @divenex has shown that the performance difference over Python iteration can be massive.

There are many situations where one would want to measure values in overlapping windows that are not regularly arrayed (so rolling windows would not work).

Yes, but you wouldn't want to use a naive loop such as the one implemented by reduceat. You'd want to implement your own rolling window calculation storing intermediate results in some way so it can be done in a single linear pass over the data. But now we're talking about an algorithm that is much more complicated than reduceat.

@shoyer I can envision cases where only some of the ROIs are overlapping. In such cases, writing a customised algorithm would be huge overkill. Let's not forget that our main userbase is scientists, who are typically time poor and need a "good enough" solution, not the absolute optimum. The low constant factors associated with np.reduceat's complexity mean it would be hard or impossible to get a better solution with pure Python code — most often the only code users are willing to write.

@jni Sure, reducing into groups with arbitrary starts and stops could be useful. But it feels like a significant increase in scope to me, and something better suited to another method rather than a replacement for reduceat (which we certainly want to deprecate, even if we never remove it).

reducing into groups with arbitrary starts and stops could be useful. But it feels like a significant increase in scope to me

This seems very trivial to me. Right now, we have code that does essentially ind1 = indices[i], ind2 = indices[i + 1]. Changing that to use two different arrays instead of the same one should be very little effort.

And the single-pass behaviour when passed contiguous ranges should be almost exactly as fast as it is right now - the only overhead is one more argument to the nditer

This seems very trivial to me.

Exactly. Moreover, it's a functionality that users have with reduceat (by using every other index), but would lose with a new function that doesn't support overlap.

Furthermore, a two-index form could emulate the old (bizarre) behaviour:

def reduceat(func, arr, inds):
    deprecation_warning()
    start = inds
    stops = zeros(inds.shape)
    stops[:-1] = start[1:]
    stops[-1] = len(arr)
    np.add(stops, 1, where=ends == starts, out=stops)  # reintroduce the "bug" that we would have to keep
    return reducebins(func, arr, starts, stops)

Meaning we don't need to maintain two very similar implementations

I am not strongly against starts and stops indices for the new reducebins, altough I still cannot see an obvious example where they are both needed. It feels like generalizing np.histogram by adding starting and ending bins edges...

Ultimately, this is fine as long as the main usage is not affected and one can still also call reducebins(arr, indices) with a single array of indices and without speed penalty.

Of course there are many situations where one needs to operate on non-overlapping bins, but in this case I would generally expect the bins not to be defined by pairs of edges alone. An available functions for this kind of scenario is Scipy's ndimage.labeled_comprehension, and the related functions like ndimage.sum and so on.

But this seems quite different from the scope of reducebins.

So, what would be a natural usage case for starts and stops in reducebins?

So, what would be a natural usage case for starts and stops in reducebins?

Achievable by other means, but a moving average of length k would be reducebins(np,add, arr, arange(n-k), k + arange(n-k)). I suspect that ignoring the cost of allocating the indices, performance would be comparable to a as_strided approach.

Uniquely, reducebins would allow a moving average of varying duration, which is not possible with as_strided

Another use case - disambiguating between including the end or the start in the one-argument form.

For instance:

a = np.arange(10)
reducebins(np.add, start=[2, 4, 6]) == [2 + 3, 4 + 5, 6 + 7 + 8 + 9]  # what `reduceat` does
reducebins(np.add, stop=[2, 4, 6])  == [0 + 1, 2 + 3, 4 + 5]          # also useful

Another use case - disambiguating between including the end or the start in the one-argument form.

I don't quite understand this one. Can you include the input tensor here? Also: what would be the default values for start/stop?

Anyways, I'm not strongly against the separate arguments, but it's not as clean of a replacement. I would love to able to say "Don't use reduceat, use reducebins instead" but that's (slightly) harder when the interface looks different.

Actually, I just realised that even a start/stop option does not cover the use-case of empty slices, which is one that has been useful to me in the past: when my properties/labels correspond to rows in a CSR sparse matrix, and I use the values of indptr to do the reduction. With reduceat, I can ignore the empty rows. Any replacement will require additional bookkeeping. So, whatever replacement you come up with, please leave reduceat around.

In [2]: A = np.random.random((4000, 4000))
In [3]: B = sparse.csr_matrix((A > 0.8) * A)
In [9]: %timeit np.add.reduceat(B.data, B.indptr[:-1]) * (np.diff(B.indptr) > 1)
1000 loops, best of 3: 1.81 ms per loop
In [12]: %timeit B.sum(axis=1).A
100 loops, best of 3: 1.95 ms per loop
In [16]: %timeit np.maximum.reduceat(B.data, B.indptr[:-1]) * (np.diff(B.indptr) > 0)
1000 loops, best of 3: 1.8 ms per loop
In [20]: %timeit B.max(axis=1).A
100 loops, best of 3: 2.12 ms per loop

Incidentally, the empty sequence conundrum can be solved the same way that Python does it: by providing an initial value. This could be a scalar or an array of the same shape as indices.

yes, i agree that the first focus needs to be on solving the empty slices
case. In the case of start=end we can either have a way to set the output
element to the identity, or to not modify the output element with a
specified out array. The problem with the current is that it is overwritten
with irrelevant data

I am fully with @shoyer about his last comment.

Let's simply define out=ufunc.reducebins(a, inds) as out[i]=ufunc.reduce(a[inds[i]:inds[i+1]]) for all i but the last, and deprecate reduceat.

Current use cases for starts and ends indices seem more naturally and likely more efficiently implemented with alternative functions like either as_strided or convolutions.

@shoyer:

I don't quite understand this one. Can you include the input tensor here? Also: what would be the default values for start/stop?

Updated with the input. See the implementation of reduce_bins in the comment that started this for the default values. I've added a docstring there too. That implementation is feature-complete but slow (due to being python).

but that's (slightly) harder when the interface looks different.

When only one the start argument is passed, the interface is identical (ignoring the identity casethat we set out to fix in the first place). These three lines mean the same thing:

np.add.reduce_at(arr, inds)
reduce_bins(np.add, arr, inds)
reduce_bins(np.add, arr, start=inds)

(the method/function distinction is not something I care too much about, and I can't define a new ufunc method as a prototype in python!)


@jni:

Actually, I just realised that even a start/stop option does not cover the use-case of empty slices, which is one that has been useful to me in the past

You're wrong, it does - in the exact same way as ufunc.reduceat already does. It's also possible simply by passing start[i] == end[i].

the empty sequence conundrum can be solved ... by providing an initial value.

Yes, we've already covered this, and ufunc.reduce already does that by filling with ufunc.identity. This is not hard to add to the existing ufunc.reduecat, especially if #8952 is merged. But as you said yourself, the current behaviour is _documented_, so we should probably not change it.


@divenex

Let's simply define out=ufunc.reducebins(a, inds) as out[i]=ufunc.reduce(a[inds[i]:inds[i+1]]) for all i but the last

So len(out) == len(inds) - 1? This is different to the current behaviour of reduceat, so @shoyer's argument about switching is stronger here


All: I've gone through earlier comments and removed quoted email replies, as they were making this discussion hard to read

@eric-wieser good point. In my above sentence I meant that for the last index the behaviour of reducebins would be different as in the current reduceat. However, in that case, I am not sure what the value should be, as the last value formally does not make sense.

Ignoring compatibility concerns, the output of reducebins (in 1D) should have size inds.size-1, for the very same reason that np.diff(a) has size a.size-1 and np.histogram(a, bins) has size bins.size-1 . However this would go against the desire to have a drop-in replacement for reduceat.

I don't think there's a convincing argument that a.size-1 is the right answer - including index 0 and/or index n seems like pretty reasonable behaviour as well. All of them seem handy in some circumstances, but I think it is very important to have a drop in replacement.

There's also another argument for stop/start hiding here - it allows you to build the diff-like behaviour if you want it, with very little cost, while still keeping the reduceat behaviour:

a = np.arange(10)
inds = [2, 4, 6]
reduce_bins(a, start=inds[:-1], stop=inds[1:])  #  [2 + 3, 4 + 5]

# or less efficiently:
reduce_at(a, inds)[:-1}
reduce_bins(a, start=inds)[:-1]
reduce_bins(a, stop=inds)[1:]

@eric-wieser I would be OK with required start and stop arguments, but I do not like making one of them optional. It is not obvious that providing only start means out[i] = func.reduce(arr[start[i]:start[i+1]]) rather than out[i] = func.reduce(arr[start[i]:]), which is what I would have guessed.

My preferred API for reducebins is like reduceat but without the confusing "exceptions" noted in the docstring. Namely, just:

For i in range(len(indices)), reduceat computes ufunc.reduce(a[indices[i]:indices[i+1]]), which becomes the i-th generalized “row” parallel to axis in the final result (i.e., in a 2-D array, for example, if axis = 0, it becomes the i-th row, but if axis = 1, it becomes the i-th column).

I could go either way on the third "exception" which requires non-negative indices (0 <= indices[i] <= a.shape[axis]), which I view as more of a sanity check rather than an exception. But possibly that one could go, too -- I can see how negative indices might be useful to someone, and it's not hard to do the math to normalize such indices.

Not automatically adding an index at the end does imply that the result should have length len(a)-1, like the result of np.histogram.

@jni Can you please give an example of what you actually want to calculate from arrays found in sparse matrices? Preferably with a concrete example with non-random numbers, and self contained (without depending on scipy.sparse).

It is not obvious that providing only start means out[i] = func.reduce(arr[start[i]:start[i+1]]) rather than out[i] = func.reduce(arr[start[i]:]), which is what I would have guessed.

The reading I was going for is that "Each bin starts at these positions", with the implication that all bins are contiguous unless explicitly specified otherwise. Perhaps I should try and draft a more complete docstring. I think I can see a strong argument for forbidding passing neither argument, so I'll remove that from my propose function.

which requires non-negative indices (0 <= indices[i] < a.shape[axis])

Note that there's also a bug here (#835) - the upper bound should be inclusive, since these are slices.

Note that there's also a bug here - the upper bound should be inclusive, since these are slices.

Fixed, thanks.

Not in the reduceat function itself, you haven't ;)

Turns out that :\doc\neps\groupby_additions.rst contains an (IMO inferior) proposal for a reduceby function.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

astrofrog picture astrofrog  ·  4Comments

marcocaccin picture marcocaccin  ·  4Comments

Levstyle picture Levstyle  ·  3Comments

MareinK picture MareinK  ·  3Comments

kevinzhai80 picture kevinzhai80  ·  4Comments