Numpy: first nonzero element (Trac #1673)

Created on 20 Oct 2012  ·  26Comments  ·  Source: numpy/numpy

_Original ticket http://projects.scipy.org/numpy/ticket/1673 on 2010-11-13 by trac user tom3118, assigned to unknown._

The "numpy for matlab users" suggests using
nonzero(A)[0][0]
to find the index of the first nonzero element of array A.

The problem with this is that A might be a million elements long and the first element might be zero.

This is an extremely common operation. An efficient, built-in method for this would be very useful. It also would easy people's transition from Matlab in which find is so common.

01 - Enhancement Other

Most helpful comment

I know this is 3 years late, but is this included in numpy now? Coming from a Matlab background this functions seems really important to me. A PR would be highly appreciated (not that Im one of the developers).

All 26 comments

_trac user tom3118 wrote on 2010-11-13_

A related use case is:
filter(test,A)[0]
In which either A is long or test is expensive.

_@rgommers wrote on 2011-03-24_

Doesn't have to only be first nonzero, first any value would be useful.

_@rgommers wrote on 2011-03-24_

As noted in #2333, the meaning is unambiguous for 1-D. For >1-D semantics are up for discussion.

Perhaps a keyword that determines the order of iteration over axes works. Or it can simply be undefined for >1-D.

_trac user lcampagn wrote on 2011-07-09_

I have seen many requests for a find_first in numpy, but most of these requests have subtly different (and incompatible) requirements such as "find first value less than x" or "find first nonzero value". I suggest the following function specification:

  ind = array.find(x, testOp='eq', arrayOp='all', axis=0, test=None)
  arguments:
    x       -> value to search for
    testOp  -> condition to test for ('eq', 'ne', 'gt', 'lt', 'ge', 'le')
    arrayOp -> method for joining multiple comparisons ('any' or 'all')
    axis    -> the axis over which to search
    test    -> for convenience, this may specify a function to call to perform
               the test. This is not expected to be efficient.
  returns: 
    first index where condition is true (or test returns true, if given)
    or None if the condition was never met

If the array has ndim > 1, then tests are performed using normal broadcasting rules.
So for example, if I have an array with shape (2,3), the following would be valid:

  ## find first row with all values=0
  array.find(0, testOp='eq', arrayOp='all', axis=0)
  ## equivalent to:
  for i in range(array.shape[axis]):
    if (array[i] == 0).all():
      return i

  ## find first column with any element greater than its corresponding element in col
  col = array([1,2])
  array.find(col, testOp='gt', arrayOp='any', axis=1)
  ## equivalent to:
  for i in range(array.shape[axis]):
    if (array[:,i] == col.any():
      return i

As I needed this functionality the other day, I've had a good look into this and was convinced that a C solution was needed to get a suitably quick result, however a chunking approach written in python has proved to be suitably quick, and much more flexible to boot, for my case.

import numpy as np
from itertools import chain, izip


def find(a, predicate, chunk_size=1024):
    """
    Find the indices of array elements that match the predicate.

    Parameters
    ----------
    a : array_like
        Input data, must be 1D.

    predicate : function
        A function which operates on sections of the given array, returning
        element-wise True or False for each data value.

    chunk_size : integer
        The length of the chunks to use when searching for matching indices.
        For high probability predicates, a smaller number will make this
        function quicker, similarly choose a larger number for low
        probabilities.

    Returns
    -------
    index_generator : generator
        A generator of (indices, data value) tuples which make the predicate
        True.

    See Also
    --------
    where, nonzero

    Notes
    -----
    This function is best used for finding the first, or first few, data values
    which match the predicate.

    Examples
    --------
    >>> a = np.sin(np.linspace(0, np.pi, 200))
    >>> result = find(a, lambda arr: arr > 0.9)
    >>> next(result)
    ((71, ), 0.900479032457)
    >>> np.where(a > 0.9)[0][0]
    71


    """
    if a.ndim != 1:
        raise ValueError('The array must be 1D, not {}.'.format(a.ndim))

    i0 = 0
    chunk_inds = chain(xrange(chunk_size, a.size, chunk_size), 
                 [None])

    for i1 in chunk_inds:
        chunk = a[i0:i1]
        for inds in izip(*predicate(chunk).nonzero()):
            yield (inds[0] + i0, ), chunk[inds]
        i0 = i1
In [1]: from np_utils import find

In [2]: import numpy as np

In [3]: import numpy.random    

In [4]: np.random.seed(1)

In [5]: a = np.random.randn(1e8)

In [6]: a.min(), a.max()
Out[6]: (-6.1194900990552776, 5.9632246301166321)

In [7]: next(find(a, lambda a: np.abs(a) > 6))
Out[7]: ((33105441,), -6.1194900990552776)

In [8]: (np.abs(a) > 6).nonzero()
Out[8]: (array([33105441]),)

In [9]: %timeit (np.abs(a) > 6).nonzero()
1 loops, best of 3: 1.51 s per loop

In [10]: %timeit next(find(a, lambda a: np.abs(a) > 6))
1 loops, best of 3: 912 ms per loop

In [11]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=100000))
1 loops, best of 3: 470 ms per loop

In [12]: %timeit next(find(a, lambda a: np.abs(a) > 6, chunk_size=1000000))
1 loops, best of 3: 483 ms per loop

I'll pop this on the dev-mailing list, but if there is sufficient interest I'd be happy enough to turn it into a PR.

Cheers,

I know this is 3 years late, but is this included in numpy now? Coming from a Matlab background this functions seems really important to me. A PR would be highly appreciated (not that Im one of the developers).

I would also be interested in this.

Maybe it's obvious, but since it was not mentioned: np.all() and np.any() would be probably even easier (and unambiguous for dimension > 1) to make lazy. Currently...

In [2]: zz = np.zeros(shape=10000000)

In [3]: zz[0] = 1

In [4]: %timeit -r 1 -n 1 any(zz)
1 loop, best of 1: 3.52 µs per loop

In [5]: %timeit -r 1 -n 1 np.any(zz)
1 loop, best of 1: 16.7 ms per loop

(sorry, I had missed the ref to #3446 )

As I a have been searching for an efficient solution of this problem for quite a while and as there seem to be no concrete plans of supporting this feature I have tried to come up with a solution that is not quite complete and versatile as the api suggested above (notably supporting for the moment only 1D arrays), but that has the advantage of being completely written in C and therefore seems rather efficient.

You find the source and the details here:

https://pypi.python.org/pypi?name=py_find_1st&:action=display

I would be grateful for any comments on the implementation notably, the question of the somewhat astonsihing performance issue when passing in boolean arrays and searchig for the first true value, that is described on that PyPi page.

I found this from a stackexchange post that's looking for this feature, which has been viewed over 70k times. @roebel did you ever get any feedback on this? Can you just put in a PR for the feature, which might get more attention?

no I never got any feedback, but a few people apparently used the package without problems.
BTW, for anaconda linux and macos I have made an anaconda installer

https://anaconda.org/roebel/py_find_1st

With respect to a PR I will have to look into the effort it requires me to adapt this such that can be merged easily into numpy. I wont' have the time to fight through discussions about API changes and extensions.

Does the removal of "priority:normal" mean that this important feature will somehow be given less attention?

The priority is still "normal", just without a label. The issue needs a champion to actually make a PR and push it through the approval process including documentation and hopefully a benchmark.

Probably useful here to point to #8528, which is nominally about all_equal but can be seen as implementing parts of this. Indeed, in https://github.com/numpy/numpy/pull/8528#issuecomment-365358119, @ahaldane explicitly suggests implementing a first reduction method on all comparison operators instead of a new gufunc all_equal.

This also means there is quite a bit of an implementation waiting to be adapted (though it is not a trivial change from a gufunc to a new reduction method, and there is the question whether we want a new method on all ufuncs, even those for which first makes little sense.

This issue has been known since (at least) 2012. Any update on a way to prevent nonzero(A)[0][0] from searching all of A?

Is it the so-called Pythonic way to always scan for all the elements?

@yunyoulu: It's the ufunc way. Let's take a step back and look at the general process of a multi-step computation in numpy, and the number of passes it takes:

  1. np.argwhere(x)[0] - performs 1 pass of the data
  2. np.argwhere(f(x))[0] - performs 2 passes of the data
  3. np.argwhere(f(g(x)))[0] - performs 3 passes of the data

One option would be to introduce a np.first function or similar - that would then look like the following, where k <= 1 varies depending on where the first element is:

  1. np.first(x)[0] - performs 0+k pass of the data
  2. np.first(f(x))[0] - performs 1+k passes of the data
  3. np.first(f(g(x)))[0] - performs 2+k passes of the data

The question to ask here is - is this saving really worth that much? Numpy is fundamentally not a lazy computing platform, and making the last step of a computation lazy isn't particularly valuable if all the previous steps were not.


Outdated

@eric-wieser

I don't think that's worded quite right. If k = 10 for some problem, it's not 1+10=11 passes of the data for np.first(f(x))[0]

(edited out by @eric-wieser for brevity, this conversation is already too long)

The use case where I most see the need for this functionality is when A is a large tensor with A.shape = (n_1, n_2, ..., n_m). In such a case, np.first(A) would require looking at only k elements of A instead of n_1*n_2*...*n_m (a potentially significant savings).

I most see the need for this functionality is when A is a large tensor

Presumably in this case you've already done at least one full pass of the data - so at best you're getting code that runs twice as fast.

The question to ask here is - is this saving really worth that much? Numpy is fundamentally not a lazy computing platform, and making the last step of a computation lazy isn't particularly valuable if all the previous steps were not.

That's an interesting point of view that, if established, could be used to justify trashing almost all efforts to improve computational performance "because we are computing something else as well and that is still slow". (It's the the same argument used by deniers of climate change action - well, until this other country does something, doing something in our country won't help anyone.) I am not convinced at all. If there is any chance of speeding up some part of a computation by 1/k, with k potentially very very small, that is worth it in my opinion.

Also, when working interactively (Jupyter etc), very often you do the "passes" of the data in separate cells, so you could end up speeding up a whole cell as well.

np.first(f(x))[0] - performs 1+k passes of the data

@eric-wieser indeed when I looked at this issue in 2017 I was really hoping it would be the first step towards a sort of np.firstwhere(x, array_or_value_to_compare), which is indeed a specific case - but important in my experience - of f(x).

@toobaz: I assume you have f = lambda x: x == value_to_compare in that example.

This is exactly the reason that I am wary of going down this path at all (cc @bersbersbers). If you're not careful, we end up with (spellings speculative):

  1. np.first(x) - save a pass vs nonzero
  2. np.first_equal(x, v) - save a pass vs first(np.equal(x, v))
  3. np.first_square_equal(x*x, v) - save a pass vs first_equal(np.square(x), v)

It should be pretty obvious that this doesn't scale at all, and we have to draw the line somewhere. I am slightly in favor of 1 being allowed, but 2 being allowed is already an explosion of API surface area, and 3 seems very unwise to me.

One argument in favor of np.first - if we implement it, numba could special-case it such that np.first(x*x == v) _within a numba context_ actually _does_ do a single pass.

Anyway it's good to know that it's impossible to do the lazy things in numpy, which clarifies current status of the issue.

However, I don't feel comfortable when performance tweaks are considered only in scalability.

Let's ask a simple questions: are personal computers scaling today? The answer is definitely NO. Three years ago when you purchase a standard laptop, they are equipped with 8GB memory; and now you'll still find 8GB in the market. However, every software is using 2x or 4x more memory than they used to. At least the workstations are not scaling in the same way clusters are.

Making a function 10x slower without changing its complexity at all is sufficient to drive one data scientist crazy. What's worse, there is nothing elegant he can do even if the bottleneck is figured out through profiling.

What I'm trying to elaborate is that having the ability to do lazy processing is always desirable and can be crucial to the responsiveness of the system as well as the productivity of the people using the language. Difficulty or workload in library development constitutes a very good excuse of not implementing these features and is certainly understandable, but don't say they are not useful.

@toobaz: I assume you have f = lambda x: x == value_to_compare in that example.

Correct

This is exactly the reason that I am wary of going down this path at all (cc @bersbersbers). If you're not careful, we end up with (spellings speculative):

1. `np.first(x)` - save a pass vs nonzero

2. `np.first_equal(x, v)` - save a pass vs `first(np.equal(x, v))`

3. `np.first_square_equal(x*x, v)` - save a pass vs `first_equal(np.square(x), v)`

I understand your concern, but I would never ask for np.first_square_equal precisely like I would never ask (and nobody, I hope, asked) for np.square_where. And yes, I do see it means making a full pass of the data if you do 3. But v is created once, and I might need to look for many different values of x over it. For instance (coming back for simplicity to example 2.), I want to verify whether all my 30 possible categories appear in my 10^9 items array - and I strongly suspect they all appear among the first 10^3 elements.

So first let me clarify my previous comment: I would like np.firstwhere(x, array_or_value_to_compare) as a function which meets my intuition, but the computational issues I had back in 2017 would have been solved even just with np.first.

Second, the point is - I think - not just of running time of the single call. It is true that I need to make a full pass of the data anyway to do 2. and 3... but maybe I already did this pass when I initialized the data, and now I'm really looking for a way to speed up a frequent operation.

I see your point that np.first really deviates from the standard numpy approach, I see it could be nontrivial to implement well... what I don't see is how it would "infect" the rest of the API, or grow an own large API.

This said, if instead you think it really is beyond the numpy scope, maybe there is scope from a small independent package instead.

Hello Paul,

I made a small benchmark comparing your solution to np.flatnonzero and my py_find_1st extension.

You find the benchmark attached.

Here the results

(base) m3088.roebel: (test) (g:master)514> ./benchmark.py
utf1st.find_1st(rr, limit, utf1st.cmp_equal)::
runtime 0.131s
np.flatnonzero(rr==limit)[0]::
runtime 2.121s
next((ii for ii, vv in enumerate(rr) if vv == limit))::
runtime 1.612s

so while your proposed solution is a 25% faster than flatnonzero as it does not require
creating the results array, it is still ~12slower than py_find_1st.find_1st.

Best
Axel

EDIT:
It appears the message I replied to by mail has disappeared, and the benchmark attached to my mail as well. The benchmark is here

https://github.com/roebel/py_find_1st/blob/master/test/benchmark.py

sorry for the noise.

On 15/05/2020 17:33, PK wrote:

How about |next(i for i, v in enumerate(x) if v)|?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/numpy/numpy/issues/2269#issuecomment-629314457, or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACAL2LS2YZALARHBHNABVILRRVOEPANCNFSM4ABV5HGA.

Was this page helpful?
0 / 5 - 0 ratings