_Original ticket http://projects.scipy.org/numpy/ticket/1514 on 2010-06-18 by trac user rspringuel, assigned to unknown._
When unique operates on an array with multiple NaN entries its return includes a NaN for each entry that was NaN in the original array.
Examples:
a = random.randint(5,size=100).astype(float)
a[12] = nan #add a single nan entry
unique(a)
array([ 0., 1., 2., 3., 4., NaN])
a[20] = nan #add a second
unique(a)
array([ 0., 1., 2., 3., 4., NaN, NaN])
a[13] = nan
unique(a) #and a third
array([ 0., 1., 2., 3., 4., NaN, NaN, NaN])
This is probably due to the fact that x == y evaluates to False if both x and y are NaN. Unique needs to have "or (isnan(x) and isnan(y))" added to the conditional that checks for the presence of a value in the already identified values. I don't know were unique lives in numpy and couldn't find it when I went looking, so I can't make the change myself (or even be sure what the exact syntax of the conditional should be).
Also, the following function can be used to patch over the behavior.
def nanunique(x):
a = numpy.unique(x)
r = []
for i in a:
if i in r or (numpy.isnan(i) and numpy.any(numpy.isnan(r))):
continue
else:
r.append(i)
return numpy.array(r)
_trac user rspringuel wrote on 2010-06-18_
Shoot, for got to use code blocks above. This only really affects the patch-over code so I'll just repost that:
def nanunique(x):
a = numpy.unique(x)
r = []
for i in a:
if i in r or (numpy.isnan(i) and numpy.any(numpy.isnan(r))):
continue
else:
r.append(i)
return numpy.array(r)
Fixed.
I'm still seeing this issue with latest master. Which commit should have fixed it? Unless I'm missing something I'd suggest re-opening this issue.
This is easy to fix for floats, but I don't see an easy way out for complex or structured dtypes. Will put a quick PR together and we can discuss the options there.
@jaimefrio I have it fixed for unique using
if issubclass(aux.dtype.type, np.inexact):
# nans always compare unequal, so encode as integers
tmp = aux.searchsorted(aux)
else:
tmp = aux
flag = np.concatenate(([True], tmp[1:] != tmp[:-1]))
but it looks like all the other operations also have problems. Maybe we need nan_equal, nan_not_equal
ufuncs, or maybe something in nanfuntions.
Sortsearching aux
for itself is a smart trick! Although sortsearching _all_ of it is a little wasteful, ideally we would want to spot the first entry with a nan, perhaps something along the lines of, after crating aux
and flag
as right now, doing:
if not aux[-1] == aux[-1]:
nanidx = np.argmin(aux == aux)
nanaux = aux[nanidx:].searchsorted(aux[nanidx:])
flag[nanidx+1:] = nanaux[1:] != nanaux[:-1]
or something similar after correcting all of the off by one errors that I have likely introduced there.
This last approach of mine would work for float and complex types, but fail for structured dtypes with floating point fields. But I still think that the searchsorting trick, even though it would work for all types, is too wasteful. Some timings:
In [10]: a = np.random.randn(1000)
In [11]: %timeit np.unique(a)
10000 loops, best of 3: 69.5 us per loop
In [12]: b = np.sort(a)
In [13]: %timeit b.searchsorted(b)
10000 loops, best of 3: 28.1 us per loop
That's going to be a 40% performance hit, which may be OK for a nanunique
function, but probably not for the general case.
2019 called, the OP problem is still valid and the code is reproducible.
@jaimefrio why can't we have it an option being false by default?
I mean, this behaviour is confusing at best, and performance is not an excuse.
@Demetrio92 while I appreciate your attempt to get this issue moving, irony/sarcasm on the internet can be interpreted differently by different people, please keep it kind. For some of us, performance is very important and we don't casually add code that slows things down.
PR #5487 may be a better place to comment or make suggestions how to move forward.
Edit: fix PR number
This issue seems to be open for 8 years, but I just want to chime in with a +1 for making the default behavior for numpy.unique
to be correct rather than fast. This broke my code and I'm sure others have/will suffer from it. We can have an optional "fast=False" and document nan behavior for fast and nans. I'd be surprised if np.unique is very often the performance bottleneck in time-critical applications.
I ran into the same issue today. The core of the np.unique routine is computing a mask on an unravelled sorted array in numpy/lib/arraysetops.py to find when the values change in that sorted array:
mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = aux[1:] != aux[:-1]
This could be replaced by something like the following, which is pretty much along the lines of jaimefrio's comment from about 5 years ago, but avoids the argmin call:
mask = np.empty(aux.shape, dtype=np.bool_)
mask[:1] = True
if (aux.shape[0] > 0 and isinstance(aux[-1], (float, np.float16,
np.float32, np.float64))
and np.isnan(aux[-1])):
aux_firstnan = np.searchsorted(aux, np.nan, side='left')
mask[1:aux_firstnan] = (aux[1:aux_firstnan] != aux[:aux_firstnan-1])
mask[aux_firstnan] = True
mask[aux_firstnan+1:] = False
else:
mask[1:] = aux[1:] != aux[:-1]
Running a few %timeit experiments I observed an at most < 10% runtime penalty if the array is large and there are very few NaN (say 10 NaN out of 1 million), and for such large arrays it actually runs faster if there are lots of NaN.
On the other hand if the arrays are small (for example, 10 entries) there is a significant performance hit because the check for float and NaN is relatively expensive, and runtime can go up to a multiple. This even applies even if there is no NaN as it's the check that's slow.
If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all. So for that case it's really a question of getting a desired result (all NaN combined into a single value group) slightly slower vs getting an undesired result (each NaN in its own value group) slightly faster.
Finally, note that this patch wouldn't fix finding unique values involving compound objects containing NaNs, such as in this example:
a = np.array([[0,1],[np.nan, 1], [np.nan, 1]])
np.unique(a, axis=0)
which still would return
array([[ 0., 1.],
[nan, 1.],
[nan, 1.]])
"If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all."
+1
A function that returns a list containing repeated elements, _e.g._ a list with more than 1 NaN, should not be called "unique". If repeated elements in the case of NaN is desired, then it should only be a special case that's disabled by default, for example numpy.unique(..., keep_NaN=False)
.
@ufmayer submit a PR!
+1
I would also support returning NaN only once
Most helpful comment
I ran into the same issue today. The core of the np.unique routine is computing a mask on an unravelled sorted array in numpy/lib/arraysetops.py to find when the values change in that sorted array:
This could be replaced by something like the following, which is pretty much along the lines of jaimefrio's comment from about 5 years ago, but avoids the argmin call:
Running a few %timeit experiments I observed an at most < 10% runtime penalty if the array is large and there are very few NaN (say 10 NaN out of 1 million), and for such large arrays it actually runs faster if there are lots of NaN.
On the other hand if the arrays are small (for example, 10 entries) there is a significant performance hit because the check for float and NaN is relatively expensive, and runtime can go up to a multiple. This even applies even if there is no NaN as it's the check that's slow.
If the array does have NaNs then it produces a different result, combining the NaNs, which is the point of it all. So for that case it's really a question of getting a desired result (all NaN combined into a single value group) slightly slower vs getting an undesired result (each NaN in its own value group) slightly faster.
Finally, note that this patch wouldn't fix finding unique values involving compound objects containing NaNs, such as in this example:
which still would return