Numpy: python3: regression for unique on dtype=object arrays with varying items types (Trac #2188)

Created on 19 Oct 2012  ·  18Comments  ·  Source: numpy/numpy

_Original ticket http://projects.scipy.org/numpy/ticket/2188 on 2012-07-23 by @yarikoptic, assigned to unknown._

tested against current master (present in 1.6.2 as well):

If with python2.x series it works ok, without puking:

$> python2.7 -c 'import numpy as np; print repr(repr(np.unique(np.array([1,2, None, "str"]))))' 
'array([None, 1, 2, str], dtype=object)'

NB I will report a bug on repr here separately if not yet filed

it fails with python3.x altogether:

$> python3.2 -c 'import numpy as np; print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.2/dist-packages/numpy/lib/arraysetops.py", line 194, in unique
    ar.sort()
TypeError: unorderable types: int() > NoneType()

whenever IMHO it must operate correctly -- semantic of unique() action should not imply ability to sort the elements

00 - Bug numpy.core

Most helpful comment

Any updates on this? I encountered this bug when trying to use scikit-learn's LabelEncoder on pandas DataFrame columns with dtype "object" and missing values

All 18 comments

any fresh ideas on this issue?

The only options for implementing unique are:

  • sorting the array
  • putting everything in a hash table
  • do brute-force == comparison on all objects against all objects

Only the sorting and hashing strategies have reasonable speed, and only the sorting and all-against-all strategies have reasonable memory overhead for large arrays. So I guess we could add fallback options to unique where if sorting doesn't work then it tries one of the other strategies? But OTOH it's not nice to have a function that sometimes suddenly takes massively more CPU or memory depending on what input you give it.

I guess I'd be +1 on a patch that adds a strategy={"sort", "hash", "bruteforce"} option to np.unique, so users with weird data can decide on what makes sense for their situation. If you want to write such a thing :-)

at first I wondered if it could be sorting + hash table for unsortable items (didn't check if cmp of elements is used when sorting elements of dtype object array) so sorting __cmp__ could position them on 'first-come-first-in-line' order?
but then realized that it doesn't provide a relief in general for uncomparable types, e.g. when it is a mix of int and str... so I wondered if for dtype=object it could be feasible to deduce first participating dtypes and 'unique' (possibly via sort) within each dtype possibly relying on hash tables for dtypes without __cmp__ ?

just for those who might need a workaround, here is how I do it via 'hashing' through the built in sets for my case:

$> python3.3 -c 'import numpy as np; print(np.array(list(set([1,2,"str", None])), dtype=object))' 
[None 1 2 'str']

Not sure what you just said :-)

But on further thought sorting is really not reliable for dtype=object
anyway. I've probably written dozens of classes that override eq but
keep the default lt, which means that sort-based unique will just
silently return the wrong answer. This is a pretty nasty bug I guess.

If the objects are hashable then you can just do set(arr) to get the unique
elements, but no guarantee that they're hashable in general. (But at least
everyone that for hashable objects this should _work_, which is not true
for sorting.) Maybe this would be a better default implementation of
np.unique for object arrays.

On Tue, Sep 17, 2013 at 5:40 PM, Yaroslav Halchenko <
[email protected]> wrote:

at first I wondered if it could be sorting + hash table for unsortable
items (didn't check if _cmp_ of elements is used when sorting elements of
dtype object array) so sorting cmp could position them on
'first-come-first-in-line' order?
but then realized that it doesn't provide a relief in general for
uncomparable types, e.g. when it is a mix of int and str... so I wondered
if for dtype=object it could be feasible to deduce first participating
dtypes and 'unique' (possibly via sort) within each dtype possibly relying
on hash tables for dtypes without cmp ?


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/641#issuecomment-24603047
.

gy... ok -- cruel description in Python:

def bucketed_unique(a):
    buckets = {}
    for x in a:
        t = type(x)
        if not (t in buckets):
            buckets[t] = bucket = []
        else:
            bucket = buckets[t]
        bucket.append(x)
    out = []
    for bucket in buckets.itervalues():
        # here could be actually set of conditions instead of blind try/except
        try:
            out.append(np.unique(bucket))
        except:
            out.append(np.array(list(set(bucket)), dtype=object))
    return np.hstack(out)
print bucketed_unique([1, 2, 'str', None, np.nan, None, np.inf, int])
[1 2 'str' None <type 'int'> nan inf]

sure thing -- no 'bucketing' should be done for non-object ndarrays

That algorithm doesn't use == as its definition of uniqueness. Objects of
different types can be ==. (Easy example: 1, 1.0). Its definition doesn't
correspond to any standard python concept.
On 17 Sep 2013 18:01, "Yaroslav Halchenko" [email protected] wrote:

sure thing -- no 'bucketing' should be done for non-object ndarrays


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/641#issuecomment-24604740
.

indeed! not sure but may be post-hoc analysis across buckets would make sense... btw atm problem reveals itself also for comparison with complex numbers:

$> python3.3 -c 'import numpy as np; print(np.unique(np.array([1, 1.0, 1+0j], dtype=object)))'  
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 194, in unique
    ar.sort()
TypeError: unorderable types: float() > complex()

$> python -c 'import numpy as np; print(np.unique(np.array([1, 1.0, 1+0j], dtype=object)))' 
[1]

although on the 2nd thought -- what should the 'unique' value dtype then be among all available choices (int/float/complex)? with non-object array it is clear... with heterogeneous object array, not so -- may be even different dtypes should be maintained as such...

Here's the way I solved argsort blowing up on mixed int/str in py3: https://github.com/pydata/pandas/pull/6222/files

order the ints before the strings in object dtypes
use a hashtable to map the locations to get the indexer
reasonably fast I think

uses pandas hashtable implementation but could easily be swapped out / adapted to c-code I think

Anyone want to take a swing at this? Not sure what to do about record dtypes.

Any updates on this? I encountered this bug when trying to use scikit-learn's LabelEncoder on pandas DataFrame columns with dtype "object" and missing values

This one is really old. Is it still relevant?

seems to be the case at least with 1.15.4 in debian:

$> python3 --version
Python 3.6.5

$> PYTHONPATH=.. python3 -c 'import numpy as np; print(np.__version__); print(repr(repr(np.unique(np.array([1,2,None, "str"])))))'                                                                                   
1.15.4
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 233, in unique
    ret = _unique1d(ar, return_index, return_inverse, return_counts)
  File "/usr/lib/python3/dist-packages/numpy/lib/arraysetops.py", line 281, in _unique1d
    ar.sort()
TypeError: '<' not supported between instances of 'NoneType' and 'int'

Definitely still relevant. Just came across this, trying to call np.unique(x, return_inverse=True) on an object array.

Regarding the question of _how_ to make this work, when sorting is undefined: I'd much prefer a slow algorithm over the status quo of raising an error. (In my experience, oftentimes, if you need performant algorithms, you shouldn't be using an object array to begin with.)

I think this is a feature request, not a bug. The docs clearly state:

Returns the _sorted_ unique elements of an array.

For the case of an array like [1, None], no such sorted array exists in python 3 since sorting is no longer well-defined.

It would be nice to have an option to _not_ return a sorted array, it would allow some optimizations.

Was this page helpful?
0 / 5 - 0 ratings