Numpy: ndarray should derive from collections.abc.Sequence?

Created on 1 Dec 2012  ·  49Comments  ·  Source: numpy/numpy

@juliantaylor raised this in a pandas issue.

The example from the ticket:

import numpy as np
import random
random.sample(np.array([1,2,3]),1)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user1/py33/lib/python3.3/random.py", line 298, in sample
    raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
TypeError: Population must be a sequence or set.  For dicts, use list(d).

This occurs on 3.3 with 1.7.0rc1.dev-3a52aa0, and on 3.2 with 1.6.2.
2.7 is unaffected of course.

The relavent code from cpython/Lib/random.py:297

from collections.abc import Set as _Set, Sequence as _Sequence
def sample(self, population, k):
# ...
    if not isinstance(population, _Sequence):
        raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")

I couldn't grep another location in the stdlib with a similar test, but lib2to3
did show an assumed equivalence:

lib2to3/fixes/fix_operator.py
5:operator.isSequenceType(obj)   -> isinstance(obj, collections.Sequence)

in 2.7

In [6]: operator.isSequenceType(np.array([1]))
Out[6]: True

but in 3.3/3.2

>>> isinstance(np.array([1]), collections.Sequence)
False

Most helpful comment

Yes, please simply add Sequence.register(np.ndarray) somewhere.

All 49 comments

Python 3.x simply performs a more strict check in random.sample than Python 2.x. On 2.x numpy is also not a Sequence subclass (ndarray has no index, count or __reversed__ methods). So I think you can regard this as either mis-use of random.sample or a backwards compatibility break by Python 3.x.

Fair enough, but that said, we could easily _make_ ndarray a Sequence,
since all of those methods make sense and would be straightforward to
implement. In fact just inheriting from it would be enough, since Sequence
provides all the missing methods as mixins.
On 2 Dec 2012 13:30, "Ralf Gommers" [email protected] wrote:

Python 3.x simply performs a more strict check in random.sample than
Python 2.x. On 2.x numpy is also not a Sequence subclass (ndarray has no
index, count or reversed methods). So I think you can regard this as
either mis-use of random.sample or a backwards compatibility break by
Python 3.x.


Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/2776#issuecomment-10929601.

I would have said the same thing, if it weren't for the 2to3 example which makes this more subtle.
index and count are not needed I believe:Link,
All the abstract methods for Sequence are implemented already, The others have default implementations.

The problem is that MutableSequence is actually more correct, and insert is not implemented.

don't forget 0-d arrays though, I am not certain what happens with them right now, but they are not really a Sequence are they?

Hmm, Sequence is meant for immutable objects, some of the methods of MutableSequence (extend, pop) don't make sense. So yes, we could add those 3 methods but it doesn't feel quite right.

It looks to me like there's really no reason for random.sample to require a Sequence instance.

__setitem__ is implemented, but not __delitem__, so maybe SomewhatMutableSequence?

It makes sense of you interpret the interface as "a sequence should have all these, but some may
have terrible O()", which is why they offer naive implementations of some methods.
Surely it's not the semantics of insert , pop that are unclear in the context of an ndarray.
I'm not sure what the right thing is concerning iteration over a 0-d array, however.
One could argue that offering an __iter__ method that just raises a TypeError rather then
StopIteration is a violation of duck typing anyway.

edit: Though I'm sure that was not a careless decision.

ndarray could just be a Sequence, that does not mean it must be immutable.

Btw. the CPython list implementation does also not support efficient insert, pop and extend either but its still a MutableSequence

ndarray cannot support in-place insert/pop/extend (ndarray's have a fixed size), so while it is a mutable sequence, it simply is not a Python MutableSequence (and never will be). It can and should support the Sequence interface though.

It would be nice if random.sample didn't check this, but on a closer look, it does have a plausible reason -- it has a number of different implementations for different types of input arguments, so it needs some way to distinguish between them. It can't just start indexing and hope for the best. Maybe we could file a bug and try and convince them to fall back on the sequence implementation by default for unrecognized types, but the earliest that could help is 3.4...

The point about 0-d arrays is a good one, though; 0-d arrays don't support the Sequence interface (they aren't even Iterable). But for Python purposes, it isn't too terrible if they lie about being Sequences and then wait for the actual access to raise an error -- duck typing means you can always fail if you want :-). It's probably possible somehow to make isinstance(a, Sequence) succeed for multidimensional arrays and fail for 0-d arrays; if we can make that happen then cool. But even if we can't the best thing to do is still probably to make ndarray's into Sequences.

Note that MaskedArray has a count method already, so adding one in ndarray which does something different will break that.

0-D arrays are better thought of as scalars with a few handy methods (at least that's how I think of them); besides not being iterable they're also not indexable which is even more weird if you think of them as arrays. So making 0-D arrays inconsistent in yet another way is not a big issue imho.

@njsmith where do you see multiple implementations? After the isinstance(Sequence) check I only see len(population) and then a conversion to a list. http://hg.python.org/cpython/file/22d891a2d533/Lib/random.py

Pandas Series and DataFrame types also have incompatible count methods, and an index attribute.

@rgommers: Hmm, you're right, I got misled by the error message, and thinking that it also accepted integers as a shorthand for range(), which it doesn't. Even so, they do want to define different behaviour for sets, sequences, and mappings. Maybe we can convince them that they should switch it to

if isinstance(population, _Set):
    population = tuple(population)
if isinstance(population, _Mapping):
    raise Blarrrgh()
# Otherwise assume that we have a sequence and hope

That's also a good point about the existing ndarray subclasses. It doesn't look like there's any way to say that ndarray is a Sequence but its subclasses aren't :-(. So which option is least bad, given that some of these subclasses cannot satisfy the Sequence interface without breaking compatibility?

  • Deprecate the incompatible usages in ndarray subclasses, and eventually remove them and replace them with Sequence-compatible versions. This seems doable for the count methods, but changing the name of Series.index would be hugely disruptive for the pandas folks. (DataFrame isn't a subclass of ndarray so technically it isn't relevant, except I guess the Series and DataFrame should be kept in sync.) I guess we can ask @wesm what he thinks but...
  • Go ahead and declare ndarray and its subclasses to satisfy the Sequence definition, and accept that for some ndarray subclasses this will be a lie. Only on the rarely used parts of the Sequence interface, though, and Python types are usually lies anyway...
  • Re-jigger our inheritance hierarchy to separate functionality and abstract-base-class nonsense. Do
class multidim_ndarray(ndarray, Sequence):
  pass

and make multi-dim arrays instances of this class instead. Subclasses aren't affected because they continue to inherit from ndarray, not multidim_ndarray. Of course, a single ndarray object can transition between 0-d and multidimensional via .resize()...

  • Accept that ndarray will never be a Sequence.

[The isSequenceType thing is a bit of a distraction. That's an old function that predates the existence of abstract base classes (which were added in 2.6), and doesn't even try to nail down the detailed interface required of sequences -- it just checks that your type (1) defines a __getitem__, (2) is not a built-in dict. Obviously this will give the wrong answer in many situations (e.g. anything that acts like a dict but isn't one!). So if one actually does want a sequence type, then isinstance(obj, Sequence) is going to do a better job, and 2to3 is doing the right thing. But it creates a problem for numpy...]

Maybe it would be possible to convince the python folks to create a new class like SequenceBase that is even below Sequence and does not guarantee .index or .count, but only .__len__ and __getitem__ or such? Its all nice that Sequence has something like index, but it seems a bit weird to force it onto things like numpy by seemingly making it how sequence like things should be duck typed. Are the python folks aware of that this is somewhat problematic?

I like the proposal of @seberg; if Python devs disagree I'd go for the second bullet of @njsmith. Missing option is to just just say that ndarrays don't satisfy the Sequence interface. Not optimal, but better than bullets 1 and 3 imho.

[Whoops, the "missing option" was there as option 4, just somehow the markdown parser decided to fold it into the previous bullet in a confusing and unreadable way. I've edited the comment to fix the formatting.]

Half of the types that are registered with Sequence, i.e. buffer and xrange, don't have these methods either. It's not clear to me that these are required methods of the interface so much as convenience methods for those who are using collections.Sequence as a base class/mixin.

@rkern: Good catch. So maybe the solution is just to add a call to Sequence.register(np.ndarray) somewhere. (This would also be a workaround for the original reporter.)

We should probably implement __reversed__ at some point as well...

@rkern you're right, this is mentioned as an open issue in the PEP: http://www.python.org/dev/peps/pep-3119/#sequences. Odd that PEPs with status Final can even have open issues.

I think the title of this bug is a bit misleading, as the bug does not exist only under python3. Sure, random.sample(numpy_array) works in python2, but isinstance(np.array([1]), collections.Sequence) should return True in any python >= 2.6.

I've just encountered this bug in Python 2.7 using the autopep8 module. By default, it converted some of the operator.isSequenceType() calls into isinstance(x, collections.Sequence). The test would become False when I pass in a numpy.ndarray. This can be a very sneaky bug.

Just encountered it as well with Python 2.7, using the python-pillow module. Image.point(lut, mode) calls isinstance(lut, collections.Sequence), the previous version used operator.isSequenceType()

Now might be a good time to revisit this since the numpy numeric scalar types were registered (#4547),

So maybe the solution is just to add a call to Sequence.register(np.ndarray) somewhere.

Yep, that's a good compromise.

Yes, please simply add Sequence.register(np.ndarray) somewhere.

@mitar any interest in submitting a PR?

Sure. Where should this go? In the same file where np.ndarray is created?

Just to be sure we actually think this is a good idea: we are just now adding a deprecation for an empty array being False (#9718), i.e., we are removing one of the things that works for sequences). Though reading the comments I think the conclusion already was that array scalars won't work, so I guess an empty array can be part of that broken promise...

For future reference, the proper place to do this would probably be in numpy.core.multiarray:
https://github.com/numpy/numpy/blob/4f1541e1cb68beb3049a21cbdec6e3d30c2afbbb/numpy/core/multiarray.py

OK I want this. How? This would be how I’d implement them as methods:

def __reversed__(self):
    return iter(self[::-1])

def index(self, value) -> int:
    return np.in1d(self, value).nonzero()[0]

def count(self, value) -> int:
    return (self == value).sum()

# Necessary due to lack of __subclasshook__
collections.abc.register(np.ndarray)

We discovered that with latest version of Tensorflow (2.0), having Sequence.register(np.ndarray) makes Tensorflow misbehave. It seems it is checking somewhere if value is a sequence and then uses is differently than if it is an ndarray.

See: https://gitlab.com/datadrivendiscovery/d3m/issues/426

Hilarious. I’m pretty sure testing if something is an array is the better idea, because it’s almost always going to be the specially handled case.

Probably the order of type checks is wrong, it should first check for ndarray, then for sequence. But if you first check for sequence, then now that code block runs.

@mitar We're considering closing this because __contains__/operator.in behaves differently (it recurses, and it doesn't for sequences), so it breaks the API contract. Do you have a use-case for this?

Can you elaborate on the API contract you have in mind here. I do not get it exactly.

The use case is writing generic code which knows how to convert between things, like if you can iterate over a sequence and get back dimension by diension, and then recurse. Then I can convert list of lists in the same way as a 2d ndarray, but it can generalize to multiple dimensions and so on. And I do not have to check more than just that it is a sequence.

As mentioned there are a couple of issues with seeing arrays as nested python sequences. __contains__ is the most obvious one, the other is that 0-D arrays are definitely not nested sequences. Also subtleties, such as a length 0 dimension exist, and generally arr[0] = 0 does not mean that arr[0] == 0, since arr[0] can be an arbitrary array itself (which would be better spelled as arr[0, ...]. Personally, I think the "nested sequence" interpretation is nice, but less useful than we tend to think. (I.e. I rarely iterate an array as for col in array and even if I do, I would not mind writing for col in array.iter(axis=0)

So I tend to see the "array is a sequence" as a slightly problematic analogy (which does not mean that it cannot be uesful, I admit).
However, whatever the use-case is, I am curious if it would not be better to explore a new ABC, such as a new "ElementwiseContainer". One that also tells the user that + and ==, etc. will work on each element, and that, unlike for Python sequences, they should not expect + to concatenate (yes + is not part of the Sequence ABC, but it feels natural in Python).

Just passing by -
I wrote to Python-ideas last week beause I noted that Python's collections.abc.Sequence does not implement __eq__ and other comparisons - even though it have all the other methods to implement those to make Sequence behave like lists and tuples. (that mail thread lead me to this issue).

I was proposing adding __eq__ there, but it would obviously make those sequences diverge from the behavior Numpy.array have.

What about formalizing more, in Python, what are "Sequences" and then delegating these things that would diverge as specialized cases - to the point of adding a collections.abc.ComparableSequence there? (and since the + for cancatenation was mentioned above, maybe some other name that would imply "sequences which comparison results in a single bool, and behave as scalars for concatenation and multiply by scalar" - i.e. - the Python behavior for the + and * in list and tuples). Therefore, the specs on Sequence could be formalized in a way that at least 1D numpy arrays would match it exactly.

This formalization on what is a Python Sequence could also help with other divergences, like the one mentioned in https://github.com/numpy/numpy/issues/2776#issuecomment-330865166 above.

I am not feeling motivated enough for going down that road alone, though - but if this makes sense, I'd happily help writing a PEP and help pushing it through. (I just wanted to check why sequence did not create an __eq__, and possibly have a PR for that when I brought this up)

@jsbueno my problem is that I don't really see what additional, or in between definition would actually be helpful for users of ndarray. The best I can think of is a Collection which has count() and index(), but is that useful? Anything else would be an ABC for things that Python itself has little or no concept of.

I think SymPy actually got it more right. It iterates through all elements of its matrices, which at least makes it a Collection.
Now, I doubt we can do much about that, and I am not even sure that the SymPy iteration of all elements is super useful (and intuitive), but at least iteration of all elements is consistent with __contains__. Note that this also means that len(Matrix) is the number of elements, and _not_ Matrix.shape[0]!

With the risk of repeating a lot from above, aside from 1-D arrays, what are numpy arrays?:

  • Container: of elements :heavy_check_mark:
  • Sized + Iterable of subarrays (if not 1-D) :question:
  • Reversible: we could just implement that, no worries there. :question:
  • count() and index(): can be implemented for elements (:heavy_check_mark:)
  • Sequence: Mismatch between subarray iterable and element container :x:

So even some of the most fundamental properties clash. NumPy could be a Container which knows how to do .index() and .count(), i.e. a Sequence but without the Iterable part. While it is independently an Iterable, but of subarrays.
And if that seems like a confusing mess, then I agree, but I think that it is by design. The only true solution would be to either go the SymPy path or just not be an Iterable to begin with. (we cannot go the SymPy path, and I doubt deprecating __iter__ has a chance.)

Personally, my expectation is that 1-D arrays aside, array-likes are simply very different beasts compared to Python Collections. When considering the iteration behaviour, you would need a MultidimensionalCollection to specifically signal the mismatch between __contains__ and __iter__ (but is that useful?).

When looking beyond what is currently defined by Sequence, I will restate that I think that the ElementwiseCollection (operators are elementwise operators rather than container operators, e.g. +) is the most defining characteristic of numpy arrays and all array-likes in general (see array programming). It is also a concept completely alien to – and sometimes at odds with – Python itself, though.

The only thing would be to mark one dimensional arrays, and only one dimensional arrays as sequences, since they do not have the mismatch of subarray vs. element. At which point, yes, __eq__ is of course not defined for them, and __nonzero__ is not defined similar to typical python sequences.

Thank you for the response, and I apologise again for jumping in the 8 year long wagon here. With your comment, couple hours after the last e-mail exchange, and chatting with another friend in the middle, I concur most of these things are better left as they are. Sometime in the future Python can opt to have a more formal definition of Sequence than "whatever collections.abc.Sequence implements now".

I will just add, after reading your comments above, that I think that the characteristics you listed as "what makes a Python Sequence" is lacking the most important feature that makes ndarrays resemble sequences like lists and tuples for me: having a contiguous index-space that can address all individual elements. But I don't think formalizing an abc for that would be of any practical value, either in coding or in static-type hinting.

@seberg That's a great synopsis.

This issue seems to be about using ndarray in contexts that expect Sequence or Iterable or Container. A simple approach would be to have members on ndarray that expose cheap views that promise and provide the appropriate interface and respond to isinstance checks. For example:

class ndarray(Generic[T]):
    def as_container(self) -> Container[T]:
        if self.ndim == 0:
            raise ValueError
        return ContainerView(self)  # correctly answers __len__, __iter__ etc.
    def as_subarray_iterable(self) -> Iterable[np.ndarray[T]]:
        if self.ndim <= 1:
            raise ValueError
        return SubarrayIterableView(self)
    def as_scalar_sequence(self) -> Sequence[T]:
        if self.ndim != 1:
            raise ValueError
        return ScalarView(self)
    def as_subarray_sequence(self) -> Sequence[np.ndarray[T]]:
        if self.ndim <= 1:
            raise ValueError
        return SubarraySequenceView(self)  # this view has to reinterpret __contains__ to do the expected thing.

Instead of ndarray promising to be everything to everyone, the user asks for she needs, and if ndarray can provide it, it does so in the cheapest way possible. If it can't, it raises an exception. This simplifies user code by moving the ndim check the user should be doing (especially when using type annotations) into ndarray.

Should that last annotation be Sequence and not Iterable?

@eric-wieser Yup! Thanks. What are your thoughts?

Well, as_subarray_sequence is practically list(arr) :)

@eric-wieser Yeah, I thought it would be cheaper to provide a view, but I have no idea.

Well, list(arr) just produces len(arr) views, which you'd end up producing anyway if I you iterated.

I still think we are focusing too much about what can be done and not enough on what the problems are at this time. In particular, all of the methods you give above are very easy to implement if you know that you have an ndarray-like (I disagree that 0-D arrays are not containers). So they would only be useful if there was a standardized ABC for them, and in that case it would also be sufficient to define that basic indexing is numpy compatible and maybe include the .flat property.

The original issue (random.sample stopped working) seems fairly irrelevant due to passed time. Yes, its mildly annoying, but possibly it is even for the better, since the user may either expect the subarrays or the elements to be chosen.

I am sure we do break some duck-typing code. Some problems probably occur with serialization (I don't have examples at hand). And many of such code will have no issue with using isinstance checks on ABCs, but hate to check for np.ndarray specifically. I do not see how adding methods to ndarray would help with that, we would need a new ABC, likely with little more than the .ndim property and possibly enshrining the nested-sequence style iteration.

Methods such as the above may be reasonable as a consumer protocol to work with any array-like, but is that the problem we are trying to solve :)? They seem not like things typical Python sequences would want to expose.

@eric-wieser

You're right of course, but you may not iterate over the whole sequence. You might only pick out a few elements.

@seberg

I still think we are focusing too much about what can be done and not enough on what the problems are at this time

I agree with you. What kind of problems are you imagining? I'm imagining that when numpy 1.10 comes out with types, I will sometimes want to use a one-dimensional numpy array as a sequence. If I want to do that currently, I need to:

  • check that it's one-dimensional, and
  • call cast to tell mypy that it's actually a sequence.

That's why I want to provide a method to automatically do that. I hate big interfaces too, but it seems to me like these kinds of methods or bare functions are going to be more and more prevalent as type annotations catch on. What do you think?

(I disagree that 0-D arrays are not containers).

I have no idea, but currently you are raising on __len__ for these, so it seems that they don't work like containers. I think it would be helpful for mypy to report an error if you pass a 0-D array to a function that accepts a container. It won't catch if you make 0-D arrays containers.

we would need a new ABC, likely with little more than the .ndim property and possibly enshrining the nested-sequence style iteration.

I didn't want to add that into my suggestion, but I think that's where you're headed anyway. I'm an avid user of the wonderfully-designed JAX library. I imagine that in the future, numpy.ndarray and jax.numpy.ndarray (which has subclasses) will both inherit from some abstract NDArray of some sort. You could have a lot more than ndim. Ideally, it would be NDArray(Generic[T]) at least, and maybe event have the shape or the number of dimensions too. It could have __eq__ returning NDArray[np.bool_]. You probably know better than me :)

A few years ago, I searched for this issue in order to suggest that numpy.array should inherit from collections.Sequence, but now I find the arguments (especially yours!!) in this thread very convincing. Numpy arrays aren't really sequences, and shoehorning them seems like it will cause more harm than good. Why not just let them be their own thing, and force users to explicitly request the interface they want?

And many of such code will have no issue with using isinstance checks on ABCs,

Now that you mention that, maybe all of my proposed methods should have returned views. That way, they can correctly answer isinstance checks.

Methods such as the above may be reasonable as a consumer protocol to work with any array-like, but is that the problem we are trying to solve :)? They seem not like things typical Python sequences would want to expose.

I definitely agree that the answer to this depends on the problems we are trying to solve. Having drunk the type annotation kool aid, I'm interested in writing succinct numpy code that passes mypy without littering code with # type: ignore. Which problems do you have in mind?

Well, type hints and interop with other array-like objects are a good motivation probably. I might suggest opening a new issue or mailing list thread. Right now, I am not sure what best to think about here, typing is forming, so maybe that will end up clarifying some things.

Was this page helpful?
0 / 5 - 0 ratings