_Original ticket http://projects.scipy.org/numpy/ticket/2188 on 2012-07-23 by **@yarikoptic**, assigned to unknown._

tested against current master (present in 1.6.2 as well):

If with python2.x series it works ok, without puking:

```
$> python2.7 -c 'import numpy as np; print repr(repr(np.unique(np.array([1,2, None, "str"]))))'
'array([None, 1, 2, str], dtype=object)'
```

NB I will report a bug on **repr** here separately if not yet filed

it fails with python3.x altogether:

```
$> python3.2 -c 'import numpy as np; print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.2/dist-packages/numpy/lib/arraysetops.py", line 194, in unique
ar.sort()
TypeError: unorderable types: int() > NoneType()
```

whenever IMHO it must operate correctly -- semantic of unique() action should not imply ability to sort the elements

any fresh ideas on this issue?

The only options for implementing `unique`

are:

- sorting the array
- putting everything in a hash table
- do brute-force
`==`

comparison on all objects against all objects

Only the sorting and hashing strategies have reasonable speed, and only the sorting and all-against-all strategies have reasonable memory overhead for large arrays. So I guess we could add fallback options to unique where if sorting doesn't work then it tries one of the other strategies? But OTOH it's not nice to have a function that sometimes suddenly takes massively more CPU or memory depending on what input you give it.

I guess I'd be +1 on a patch that adds a strategy={"sort", "hash", "bruteforce"} option to np.unique, so users with weird data can decide on what makes sense for their situation. If you want to write such a thing :-)

at first I wondered if it could be sorting + hash table for unsortable items (didn't check if **cmp** of elements is used when sorting elements of dtype object array) so sorting `__cmp__`

could position them on 'first-come-first-in-line' order?

but then realized that it doesn't provide a relief in general for uncomparable types, e.g. when it is a mix of int and str... so I wondered if for dtype=object it could be feasible to deduce first participating dtypes and 'unique' (possibly via sort) within each dtype possibly relying on hash tables for dtypes without `__cmp__`

?

just for those who might need a workaround, here is how I do it via 'hashing' through the built in sets for my case:

```
$> python3.3 -c 'import numpy as np; print(np.array(list(set([1,2,"str", None])), dtype=object))'
[None 1 2 'str']
```

Not sure what you just said :-)

But on further thought sorting is really not reliable for dtype=object

anyway. I've probably written dozens of classes that override **eq** but

keep the default **lt**, which means that sort-based unique will just

silently return the wrong answer. This is a pretty nasty bug I guess.

If the objects are hashable then you can just do set(arr) to get the unique

elements, but no guarantee that they're hashable in general. (But at least

everyone that for hashable objects this should _work_, which is not true

for sorting.) Maybe this would be a better default implementation of

np.unique for object arrays.

On Tue, Sep 17, 2013 at 5:40 PM, Yaroslav Halchenko <

[email protected]> wrote:

at first I wondered if it could be sorting + hash table for unsortable

items (didn't check if _cmp_ of elements is used when sorting elements of

dtype object array) so sortingcmpcould position them on

'first-come-first-in-line' order?

but then realized that it doesn't provide a relief in general for

uncomparable types, e.g. when it is a mix of int and str... so I wondered

if for dtype=object it could be feasible to deduce first participating

dtypes and 'unique' (possibly via sort) within each dtype possibly relying

on hash tables for dtypes withoutcmp?—

Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/641#issuecomment-24603047

.

gy... ok -- cruel description in Python:

```
def bucketed_unique(a):
buckets = {}
for x in a:
t = type(x)
if not (t in buckets):
buckets[t] = bucket = []
else:
bucket = buckets[t]
bucket.append(x)
out = []
for bucket in buckets.itervalues():
# here could be actually set of conditions instead of blind try/except
try:
out.append(np.unique(bucket))
except:
out.append(np.array(list(set(bucket)), dtype=object))
return np.hstack(out)
```

```
print bucketed_unique([1, 2, 'str', None, np.nan, None, np.inf, int])
[1 2 'str' None <type 'int'> nan inf]
```

sure thing -- no 'bucketing' should be done for non-object ndarrays

That algorithm doesn't use == as its definition of uniqueness. Objects of

different types can be ==. (Easy example: 1, 1.0). Its definition doesn't

correspond to any standard python concept.

On 17 Sep 2013 18:01, "Yaroslav Halchenko" [email protected] wrote:

sure thing -- no 'bucketing' should be done for non-object ndarrays

—

Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/641#issuecomment-24604740

.

indeed! not sure but may be post-hoc analysis across buckets would make sense... btw atm problem reveals itself also for comparison with complex numbers:

```
$> python3.3 -c 'import numpy as np; print(np.unique(np.array([1, 1.0, 1+0j], dtype=object)))'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 194, in unique
ar.sort()
TypeError: unorderable types: float() > complex()
$> python -c 'import numpy as np; print(np.unique(np.array([1, 1.0, 1+0j], dtype=object)))'
[1]
```

although on the 2nd thought -- what should the 'unique' value dtype then be among all available choices (int/float/complex)? with non-object array it is clear... with heterogeneous object array, not so -- may be even different dtypes should be maintained as such...

Here's the way I solved `argsort`

blowing up on mixed int/str in py3: https://github.com/pydata/pandas/pull/6222/files

order the ints before the strings in object dtypes

use a hashtable to map the locations to get the indexer

reasonably fast I think

uses pandas hashtable implementation but could easily be swapped out / adapted to c-code I think

Anyone want to take a swing at this? Not sure what to do about record dtypes.

This one is really old. Is it still relevant?

seems to be the case at least with 1.15.4 in debian:

```
$> python3 --version
Python 3.6.5
$> PYTHONPATH=.. python3 -c 'import numpy as np; print(np.__version__); print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'
1.15.4
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 233, in unique
ret = _unique1d(ar, return_index, return_inverse, return_counts)
File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 281, in _unique1d
ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'int'
```

Definitely still relevant. Just came across this, trying to call `np.unique(x, return_inverse=True)`

on an object array.

Regarding the question of _how_ to make this work, when sorting is undefined: I'd much prefer a slow algorithm over the status quo of raising an error. (In my experience, oftentimes, if you need performant algorithms, you shouldn't be using an object array to begin with.)

I think this is a feature request, not a bug. The docs clearly state:

Returns the _sorted_ unique elements of an array.

For the case of an array like `[1, None]`

, no such sorted array exists in python 3 since sorting is no longer well-defined.

It would be nice to have an option to _not_ return a sorted array, it would allow some optimizations.

Was this page helpful?

0 / 5 - 0 ratings

## Most helpful comment

Any updates on this? I encountered this bug when trying to use scikit-learn's LabelEncoder on pandas DataFrame columns with dtype "object" and missing values

sobayedon 24 Oct 2016