Numpy: unique and NaN entries (Trac #1514)

Created on 19 Oct 2012  ·  14Comments  ·  Source: numpy/numpy

_Original ticket http://projects.scipy.org/numpy/ticket/1514 on 2010-06-18 by trac user rspringuel, assigned to unknown._

When unique operates on an array with multiple NaN entries its return includes a NaN for each entry that was NaN in the original array.

Examples:
a = random.randint(5,size=100).astype(float)

a[12] = nan #add a single nan entry
unique(a)
array([ 0., 1., 2., 3., 4., NaN])
a[20] = nan #add a second
unique(a)
array([ 0., 1., 2., 3., 4., NaN, NaN])
a[13] = nan
unique(a) #and a third
array([ 0., 1., 2., 3., 4., NaN, NaN, NaN])

This is probably due to the fact that x == y evaluates to False if both x and y are NaN. Unique needs to have "or (isnan(x) and isnan(y))" added to the conditional that checks for the presence of a value in the already identified values. I don't know were unique lives in numpy and couldn't find it when I went looking, so I can't make the change myself (or even be sure what the exact syntax of the conditional should be).

Also, the following function can be used to patch over the behavior.

def nanunique(x):
a = numpy.unique(x)
r = []
for i in a:
if i in r or (numpy.isnan(i) and numpy.any(numpy.isnan(r))):
continue
else:
r.append(i)
return numpy.array(r)

00 - Bug Other

Most helpful comment

I ran into the same issue today. The core of the np.unique routine is computing a mask on an unravelled sorted array in numpy/lib/arraysetops.py to find when the values change in that sorted array:

mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = aux[1:] != aux[:-1]

This could be replaced by something like the following, which is pretty much along the lines of jaimefrio's comment from about 5 years ago, but avoids the argmin call:

mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
if (aux.shape[0] > 0 and isinstance(aux[-1], (float, np.float16,
                                              np.float32, np.float64))
    and np.isnan(aux[-1])):
    aux_firstnan = np.searchsorted(aux, np.nan, side='left')
    mask[1:aux_firstnan] = (aux[1:aux_firstnan] != aux[:aux_firstnan-1])
    mask[aux_firstnan] = True
    mask[aux_firstnan+1:] = False
else:
    mask[1:] = aux[1:] != aux[:-1]

Running a few %timeit experiments I observed an at most < 10% runtime penalty if the array is large and there are very few NaN (say 10 NaN out of 1 million), and for such large arrays it actually runs faster if there are lots of NaN.

On the other hand if the arrays are small (for example, 10 entries) there is a significant performance hit because the check for float and NaN is relatively expensive, and runtime can go up to a multiple. This even applies even if there is no NaN as it's the check that's slow.

If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all. So for that case it's really a question of getting a desired result (all NaN combined into a single value group) slightly slower vs getting an undesired result (each NaN in its own value group) slightly faster.

Finally, note that this patch wouldn't fix finding unique values involving compound objects containing NaNs, such as in this example:

a = np.array([[0,1],[np.nan, 1], [np.nan, 1]])
np.unique(a, axis=0)

which still would return

array([[ 0.,  1.],
       [nan,  1.],
       [nan,  1.]])

All 14 comments

_trac user rspringuel wrote on 2010-06-18_

Shoot, for got to use code blocks above. This only really affects the patch-over code so I'll just repost that:

def nanunique(x):
    a = numpy.unique(x)
    r = []
    for i in a:
        if i in r or (numpy.isnan(i) and numpy.any(numpy.isnan(r))):
            continue
        else:
            r.append(i)
    return numpy.array(r)

Fixed.

I'm still seeing this issue with latest master. Which commit should have fixed it? Unless I'm missing something I'd suggest re-opening this issue.

This is easy to fix for floats, but I don't see an easy way out for complex or structured dtypes. Will put a quick PR together and we can discuss the options there.

@jaimefrio I have it fixed for unique using

    if issubclass(aux.dtype.type, np.inexact):
        # nans always compare unequal, so encode as integers
        tmp = aux.searchsorted(aux)
    else:
        tmp = aux
    flag = np.concatenate(([True], tmp[1:] != tmp[:-1]))

but it looks like all the other operations also have problems. Maybe we need nan_equal, nan_not_equal ufuncs, or maybe something in nanfuntions.

Sortsearching aux for itself is a smart trick! Although sortsearching _all_ of it is a little wasteful, ideally we would want to spot the first entry with a nan, perhaps something along the lines of, after crating aux and flag as right now, doing:

if not aux[-1] == aux[-1]:
    nanidx = np.argmin(aux == aux)
    nanaux = aux[nanidx:].searchsorted(aux[nanidx:])
    flag[nanidx+1:] = nanaux[1:] != nanaux[:-1]

or something similar after correcting all of the off by one errors that I have likely introduced there.

This last approach of mine would work for float and complex types, but fail for structured dtypes with floating point fields. But I still think that the searchsorting trick, even though it would work for all types, is too wasteful. Some timings:

In [10]: a = np.random.randn(1000)

In [11]: %timeit np.unique(a)
10000 loops, best of 3: 69.5 us per loop

In [12]: b = np.sort(a)

In [13]: %timeit b.searchsorted(b)
10000 loops, best of 3: 28.1 us per loop

That's going to be a 40% performance hit, which may be OK for a nanunique function, but probably not for the general case.

2019 called, the OP problem is still valid and the code is reproducible.

@jaimefrio why can't we have it an option being false by default?

I mean, this behaviour is confusing at best, and performance is not an excuse.

@Demetrio92 while I appreciate your attempt to get this issue moving, irony/sarcasm on the internet can be interpreted differently by different people, please keep it kind. For some of us, performance is very important and we don't casually add code that slows things down.

PR #5487 may be a better place to comment or make suggestions how to move forward.

Edit: fix PR number

This issue seems to be open for 8 years, but I just want to chime in with a +1 for making the default behavior for numpy.unique to be correct rather than fast. This broke my code and I'm sure others have/will suffer from it. We can have an optional "fast=False" and document nan behavior for fast and nans. I'd be surprised if np.unique is very often the performance bottleneck in time-critical applications.

I ran into the same issue today. The core of the np.unique routine is computing a mask on an unravelled sorted array in numpy/lib/arraysetops.py to find when the values change in that sorted array:

mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = aux[1:] != aux[:-1]

This could be replaced by something like the following, which is pretty much along the lines of jaimefrio's comment from about 5 years ago, but avoids the argmin call:

mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
if (aux.shape[0] > 0 and isinstance(aux[-1], (float, np.float16,
                                              np.float32, np.float64))
    and np.isnan(aux[-1])):
    aux_firstnan = np.searchsorted(aux, np.nan, side='left')
    mask[1:aux_firstnan] = (aux[1:aux_firstnan] != aux[:aux_firstnan-1])
    mask[aux_firstnan] = True
    mask[aux_firstnan+1:] = False
else:
    mask[1:] = aux[1:] != aux[:-1]

Running a few %timeit experiments I observed an at most < 10% runtime penalty if the array is large and there are very few NaN (say 10 NaN out of 1 million), and for such large arrays it actually runs faster if there are lots of NaN.

On the other hand if the arrays are small (for example, 10 entries) there is a significant performance hit because the check for float and NaN is relatively expensive, and runtime can go up to a multiple. This even applies even if there is no NaN as it's the check that's slow.

If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all. So for that case it's really a question of getting a desired result (all NaN combined into a single value group) slightly slower vs getting an undesired result (each NaN in its own value group) slightly faster.

Finally, note that this patch wouldn't fix finding unique values involving compound objects containing NaNs, such as in this example:

a = np.array([[0,1],[np.nan, 1], [np.nan, 1]])
np.unique(a, axis=0)

which still would return

array([[ 0.,  1.],
       [nan,  1.],
       [nan,  1.]])

"If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all."

+1

A function that returns a list containing repeated elements, _e.g._ a list with more than 1 NaN, should not be called "unique". If repeated elements in the case of NaN is desired, then it should only be a special case that's disabled by default, for example numpy.unique(..., keep_NaN=False).

@ufmayer submit a PR!

+1
I would also support returning NaN only once

Was this page helpful?
0 / 5 - 0 ratings