@juliantaylor raised this in a pandas issue.
The example from the ticket:
import numpy as np import random random.sample(np.array([1,2,3]),1) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user1/py33/lib/python3.3/random.py", line 298, in sample raise TypeError("Population must be a sequence or set. For dicts, use list(d).") TypeError: Population must be a sequence or set. For dicts, use list(d).
This occurs on 3.3 with 1.7.0rc1.dev-3a52aa0, and on 3.2 with 1.6.2.
2.7 is unaffected of course.
The relavent code from cpython/Lib/random.py:297
from collections.abc import Set as _Set, Sequence as _Sequence def sample(self, population, k): # ... if not isinstance(population, _Sequence): raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
I couldn't grep another location in the stdlib with a similar test, but lib2to3
did show an assumed equivalence:
lib2to3/fixes/fix_operator.py 5:operator.isSequenceType(obj) -> isinstance(obj, collections.Sequence)
In : operator.isSequenceType(np.array()) Out: True
but in 3.3/3.2
>>> isinstance(np.array(), collections.Sequence) False
Yes, please simply add
Python 3.x simply performs a more strict check in
random.sample than Python 2.x. On 2.x numpy is also not a Sequence subclass (ndarray has no
__reversed__ methods). So I think you can regard this as either mis-use of
random.sample or a backwards compatibility break by Python 3.x.
Fair enough, but that said, we could easily _make_ ndarray a Sequence,
since all of those methods make sense and would be straightforward to
implement. In fact just inheriting from it would be enough, since Sequence
provides all the missing methods as mixins.
On 2 Dec 2012 13:30, "Ralf Gommers" [email protected] wrote:
Python 3.x simply performs a more strict check in random.sample than
Python 2.x. On 2.x numpy is also not a Sequence subclass (ndarray has no
index, count or reversed methods). So I think you can regard this as
either mis-use of random.sample or a backwards compatibility break by
Reply to this email directly or view it on GitHubhttps://github.com/numpy/numpy/issues/2776#issuecomment-10929601.
I would have said the same thing, if it weren't for the 2to3 example which makes this more subtle.
index and count are not needed I believe:Link,
All the abstract methods for
Sequence are implemented already, The others have default implementations.
The problem is that
MutableSequence is actually more correct, and
insert is not implemented.
don't forget 0-d arrays though, I am not certain what happens with them right now, but they are not really a Sequence are they?
Sequence is meant for immutable objects, some of the methods of
MutableSequence (extend, pop) don't make sense. So yes, we could add those 3 methods but it doesn't feel quite right.
It looks to me like there's really no reason for
random.sample to require a
__setitem__ is implemented, but not
__delitem__, so maybe
It makes sense of you interpret the interface as "a sequence should have all these, but some may
have terrible O()", which is why they offer naive implementations of some methods.
Surely it's not the semantics of
pop that are unclear in the context of an ndarray.
I'm not sure what the right thing is concerning iteration over a 0-d array, however.
One could argue that offering an
__iter__ method that just raises a
TypeError rather then
StopIteration is a violation of duck typing anyway.
edit: Though I'm sure that was not a careless decision.
ndarray could just be a Sequence, that does not mean it must be immutable.
Btw. the CPython list implementation does also not support efficient insert, pop and extend either but its still a MutableSequence
ndarray cannot support in-place insert/pop/extend (ndarray's have a fixed size), so while it is a mutable sequence, it simply is not a Python
MutableSequence (and never will be). It can and should support the
Sequence interface though.
It would be nice if
random.sample didn't check this, but on a closer look, it does have a plausible reason -- it has a number of different implementations for different types of input arguments, so it needs some way to distinguish between them. It can't just start indexing and hope for the best. Maybe we could file a bug and try and convince them to fall back on the sequence implementation by default for unrecognized types, but the earliest that could help is 3.4...
The point about 0-d arrays is a good one, though; 0-d arrays don't support the
Sequence interface (they aren't even
Iterable). But for Python purposes, it isn't too terrible if they lie about being
Sequences and then wait for the actual access to raise an error -- duck typing means you can always fail if you want :-). It's probably possible somehow to make
isinstance(a, Sequence) succeed for multidimensional arrays and fail for 0-d arrays; if we can make that happen then cool. But even if we can't the best thing to do is still probably to make ndarray's into
Note that MaskedArray has a
count method already, so adding one in ndarray which does something different will break that.
0-D arrays are better thought of as scalars with a few handy methods (at least that's how I think of them); besides not being iterable they're also not indexable which is even more weird if you think of them as arrays. So making 0-D arrays inconsistent in yet another way is not a big issue imho.
@njsmith where do you see multiple implementations? After the isinstance(Sequence) check I only see
len(population) and then a conversion to a list. http://hg.python.org/cpython/file/22d891a2d533/Lib/random.py
Pandas Series and DataFrame types also have incompatible count methods, and an index attribute.
@rgommers: Hmm, you're right, I got misled by the error message, and thinking that it also accepted integers as a shorthand for
range(), which it doesn't. Even so, they do want to define different behaviour for sets, sequences, and mappings. Maybe we can convince them that they should switch it to
if isinstance(population, _Set): population = tuple(population) if isinstance(population, _Mapping): raise Blarrrgh() # Otherwise assume that we have a sequence and hope
That's also a good point about the existing ndarray subclasses. It doesn't look like there's any way to say that ndarray is a
Sequence but its subclasses aren't :-(. So which option is least bad, given that some of these subclasses cannot satisfy the
Sequence interface without breaking compatibility?
Sequence-compatible versions. This seems doable for the
countmethods, but changing the name of
Series.indexwould be hugely disruptive for the pandas folks. (DataFrame isn't a subclass of ndarray so technically it isn't relevant, except I guess the Series and DataFrame should be kept in sync.) I guess we can ask @wesm what he thinks but...
Sequencedefinition, and accept that for some ndarray subclasses this will be a lie. Only on the rarely used parts of the
Sequenceinterface, though, and Python types are usually lies anyway...
class multidim_ndarray(ndarray, Sequence): pass
and make multi-dim arrays instances of this class instead. Subclasses aren't affected because they continue to inherit from
multidim_ndarray. Of course, a single ndarray object can transition between 0-d and multidimensional via
ndarraywill never be a
isSequenceType thing is a bit of a distraction. That's an old function that predates the existence of abstract base classes (which were added in 2.6), and doesn't even try to nail down the detailed interface required of sequences -- it just checks that your type (1) defines a
__getitem__, (2) is not a built-in
dict. Obviously this will give the wrong answer in many situations (e.g. anything that acts like a dict but isn't one!). So if one actually does want a sequence type, then
isinstance(obj, Sequence) is going to do a better job, and 2to3 is doing the right thing. But it creates a problem for numpy...]
Maybe it would be possible to convince the python folks to create a new class like
SequenceBase that is even below Sequence and does not guarantee
.count, but only
__getitem__ or such? Its all nice that Sequence has something like
index, but it seems a bit weird to force it onto things like
numpy by seemingly making it how sequence like things should be duck typed. Are the python folks aware of that this is somewhat problematic?
I like the proposal of @seberg; if Python devs disagree I'd go for the second bullet of @njsmith. Missing option is to just just say that ndarrays don't satisfy the Sequence interface. Not optimal, but better than bullets 1 and 3 imho.
[Whoops, the "missing option" was there as option 4, just somehow the markdown parser decided to fold it into the previous bullet in a confusing and unreadable way. I've edited the comment to fix the formatting.]
Half of the types that are registered with
xrange, don't have these methods either. It's not clear to me that these are required methods of the interface so much as convenience methods for those who are using
collections.Sequence as a base class/mixin.
@rkern: Good catch. So maybe the solution is just to add a call to
Sequence.register(np.ndarray) somewhere. (This would also be a workaround for the original reporter.)
We should probably implement
__reversed__ at some point as well...
@rkern you're right, this is mentioned as an open issue in the PEP: http://www.python.org/dev/peps/pep-3119/#sequences. Odd that PEPs with status Final can even have open issues.
I think the title of this bug is a bit misleading, as the bug does not exist only under python3. Sure,
random.sample(numpy_array) works in python2, but
isinstance(np.array(), collections.Sequence) should return
True in any python >= 2.6.
I've just encountered this bug in Python 2.7 using the autopep8 module. By default, it converted some of the operator.isSequenceType() calls into isinstance(x, collections.Sequence). The test would become False when I pass in a numpy.ndarray. This can be a very sneaky bug.
Just encountered it as well with Python 2.7, using the python-pillow module. Image.point(lut, mode) calls isinstance(lut, collections.Sequence), the previous version used operator.isSequenceType()
Now might be a good time to revisit this since the numpy numeric scalar types were registered (#4547),
So maybe the solution is just to add a call to Sequence.register(np.ndarray) somewhere.
Yep, that's a good compromise.
Yes, please simply add
@mitar any interest in submitting a PR?
Sure. Where should this go? In the same file where
np.ndarray is created?
Just to be sure we actually think this is a good idea: we are just now adding a deprecation for an empty array being
False (#9718), i.e., we are removing one of the things that works for sequences). Though reading the comments I think the conclusion already was that array scalars won't work, so I guess an empty array can be part of that broken promise...
For future reference, the proper place to do this would probably be in
OK I want this. How? This would be how I’d implement them as methods:
def __reversed__(self): return iter(self[::-1]) def index(self, value) -> int: return np.in1d(self, value).nonzero() def count(self, value) -> int: return (self == value).sum() # Necessary due to lack of __subclasshook__ collections.abc.register(np.ndarray)
We discovered that with latest version of Tensorflow (2.0), having
Sequence.register(np.ndarray) makes Tensorflow misbehave. It seems it is checking somewhere if value is a sequence and then uses is differently than if it is an ndarray.
Hilarious. I’m pretty sure testing if something is an array is the better idea, because it’s almost always going to be the specially handled case.
Probably the order of type checks is wrong, it should first check for ndarray, then for sequence. But if you first check for sequence, then now that code block runs.
@mitar We're considering closing this because
operator.in behaves differently (it recurses, and it doesn't for sequences), so it breaks the API contract. Do you have a use-case for this?
Can you elaborate on the API contract you have in mind here. I do not get it exactly.
The use case is writing generic code which knows how to convert between things, like if you can iterate over a sequence and get back dimension by diension, and then recurse. Then I can convert list of lists in the same way as a 2d ndarray, but it can generalize to multiple dimensions and so on. And I do not have to check more than just that it is a sequence.
As mentioned there are a couple of issues with seeing arrays as nested python sequences.
__contains__ is the most obvious one, the other is that 0-D arrays are definitely not nested sequences. Also subtleties, such as a length 0 dimension exist, and generally
arr = 0 does not mean that
arr == 0, since
arr can be an arbitrary array itself (which would be better spelled as
arr[0, ...]. Personally, I think the "nested sequence" interpretation is nice, but less useful than we tend to think. (I.e. I rarely iterate an array as
for col in array and even if I do, I would not mind writing
for col in array.iter(axis=0)
So I tend to see the "array is a sequence" as a slightly problematic analogy (which does not mean that it cannot be uesful, I admit).
However, whatever the use-case is, I am curious if it would not be better to explore a new ABC, such as a new "ElementwiseContainer". One that also tells the user that
==, etc. will work on each element, and that, unlike for Python sequences, they should not expect
+ to concatenate (yes
+ is not part of the Sequence ABC, but it feels natural in Python).
Just passing by -
I wrote to Python-ideas last week beause I noted that Python's
collections.abc.Sequence does not implement
__eq__ and other comparisons - even though it have all the other methods to implement those to make Sequence behave like lists and tuples. (that mail thread lead me to this issue).
I was proposing adding
__eq__ there, but it would obviously make those sequences diverge from the behavior Numpy.array have.
What about formalizing more, in Python, what are "Sequences" and then delegating these things that would diverge as specialized cases - to the point of adding a
collections.abc.ComparableSequence there? (and since the
+ for cancatenation was mentioned above, maybe some other name that would imply "sequences which comparison results in a single bool, and behave as scalars for concatenation and multiply by scalar" - i.e. - the Python behavior for the
* in list and tuples). Therefore, the specs on Sequence could be formalized in a way that at least 1D numpy arrays would match it exactly.
This formalization on what is a Python Sequence could also help with other divergences, like the one mentioned in https://github.com/numpy/numpy/issues/2776#issuecomment-330865166 above.
I am not feeling motivated enough for going down that road alone, though - but if this makes sense, I'd happily help writing a PEP and help pushing it through. (I just wanted to check why sequence did not create an
__eq__, and possibly have a PR for that when I brought this up)
@jsbueno my problem is that I don't really see what additional, or in between definition would actually be helpful for users of
ndarray. The best I can think of is a
Collection which has
index(), but is that useful? Anything else would be an ABC for things that Python itself has little or no concept of.
I think SymPy actually got it more right. It iterates through all elements of its matrices, which at least makes it a
Now, I doubt we can do much about that, and I am not even sure that the SymPy iteration of all elements is super useful (and intuitive), but at least iteration of all elements is consistent with
__contains__. Note that this also means that
len(Matrix) is the number of elements, and _not_
With the risk of repeating a lot from above, aside from 1-D arrays, what are numpy arrays?:
Container: of elements :heavy_check_mark:
Iterableof subarrays (if not 1-D) :question:
Reversible: we could just implement that, no worries there. :question:
index(): can be implemented for elements (:heavy_check_mark:)
Sequence: Mismatch between subarray iterable and element container :x:
So even some of the most fundamental properties clash. NumPy could be a
Container which knows how to do
.count(), i.e. a
Sequence but without the
Iterable part. While it is independently an
Iterable, but of subarrays.
And if that seems like a confusing mess, then I agree, but I think that it is by design. The only true solution would be to either go the SymPy path or just not be an
Iterable to begin with. (we cannot go the SymPy path, and I doubt deprecating
__iter__ has a chance.)
Personally, my expectation is that 1-D arrays aside, array-likes are simply very different beasts compared to Python Collections. When considering the iteration behaviour, you would need a
MultidimensionalCollection to specifically signal the mismatch between
__iter__ (but is that useful?).
When looking beyond what is currently defined by
Sequence, I will restate that I think that the
ElementwiseCollection (operators are elementwise operators rather than container operators, e.g.
+) is the most defining characteristic of numpy arrays and all array-likes in general (see array programming). It is also a concept completely alien to – and sometimes at odds with – Python itself, though.
The only thing would be to mark one dimensional arrays, and only one dimensional arrays as sequences, since they do not have the mismatch of subarray vs. element. At which point, yes,
__eq__ is of course not defined for them, and
__nonzero__ is not defined similar to typical python sequences.
Thank you for the response, and I apologise again for jumping in the 8 year long wagon here. With your comment, couple hours after the last e-mail exchange, and chatting with another friend in the middle, I concur most of these things are better left as they are. Sometime in the future Python can opt to have a more formal definition of Sequence than "whatever collections.abc.Sequence implements now".
I will just add, after reading your comments above, that I think that the characteristics you listed as "what makes a Python Sequence" is lacking the most important feature that makes ndarrays resemble sequences like lists and tuples for me: having a contiguous index-space that can address all individual elements. But I don't think formalizing an abc for that would be of any practical value, either in coding or in static-type hinting.
@seberg That's a great synopsis.
This issue seems to be about using
ndarray in contexts that expect
Container. A simple approach would be to have members on
ndarray that expose cheap views that promise and provide the appropriate interface and respond to
isinstance checks. For example:
class ndarray(Generic[T]): def as_container(self) -> Container[T]: if self.ndim == 0: raise ValueError return ContainerView(self) # correctly answers __len__, __iter__ etc. def as_subarray_iterable(self) -> Iterable[np.ndarray[T]]: if self.ndim <= 1: raise ValueError return SubarrayIterableView(self) def as_scalar_sequence(self) -> Sequence[T]: if self.ndim != 1: raise ValueError return ScalarView(self) def as_subarray_sequence(self) -> Sequence[np.ndarray[T]]: if self.ndim <= 1: raise ValueError return SubarraySequenceView(self) # this view has to reinterpret __contains__ to do the expected thing.
ndarray promising to be everything to everyone, the user asks for she needs, and if
ndarray can provide it, it does so in the cheapest way possible. If it can't, it raises an exception. This simplifies user code by moving the
ndim check the user should be doing (especially when using type annotations) into
Should that last annotation be
Sequence and not
@eric-wieser Yup! Thanks. What are your thoughts?
as_subarray_sequence is practically
@eric-wieser Yeah, I thought it would be cheaper to provide a view, but I have no idea.
list(arr) just produces
len(arr) views, which you'd end up producing anyway if I you iterated.
I still think we are focusing too much about what can be done and not enough on what the problems are at this time. In particular, all of the methods you give above are very easy to implement if you know that you have an ndarray-like (I disagree that 0-D arrays are not containers). So they would only be useful if there was a standardized ABC for them, and in that case it would also be sufficient to define that basic indexing is numpy compatible and maybe include the
The original issue (
random.sample stopped working) seems fairly irrelevant due to passed time. Yes, its mildly annoying, but possibly it is even for the better, since the user may either expect the subarrays or the elements to be chosen.
I am sure we do break some duck-typing code. Some problems probably occur with serialization (I don't have examples at hand). And many of such code will have no issue with using
isinstance checks on
ABCs, but hate to check for
np.ndarray specifically. I do not see how adding methods to ndarray would help with that, we would need a new
ABC, likely with little more than the
.ndim property and possibly enshrining the nested-sequence style iteration.
Methods such as the above may be reasonable as a consumer protocol to work with any array-like, but is that the problem we are trying to solve :)? They seem not like things typical Python sequences would want to expose.
You're right of course, but you may not iterate over the whole sequence. You might only pick out a few elements.
I still think we are focusing too much about what can be done and not enough on what the problems are at this time
I agree with you. What kind of problems are you imagining? I'm imagining that when numpy 1.10 comes out with types, I will sometimes want to use a one-dimensional numpy array as a sequence. If I want to do that currently, I need to:
castto tell mypy that it's actually a sequence.
That's why I want to provide a method to automatically do that. I hate big interfaces too, but it seems to me like these kinds of methods or bare functions are going to be more and more prevalent as type annotations catch on. What do you think?
(I disagree that 0-D arrays are not containers).
I have no idea, but currently you are raising on
__len__ for these, so it seems that they don't work like containers. I think it would be helpful for mypy to report an error if you pass a 0-D array to a function that accepts a container. It won't catch if you make 0-D arrays containers.
we would need a new ABC, likely with little more than the .ndim property and possibly enshrining the nested-sequence style iteration.
I didn't want to add that into my suggestion, but I think that's where you're headed anyway. I'm an avid user of the wonderfully-designed JAX library. I imagine that in the future,
jax.numpy.ndarray (which has subclasses) will both inherit from some abstract
NDArray of some sort. You could have a lot more than
ndim. Ideally, it would be
NDArray(Generic[T]) at least, and maybe event have the shape or the number of dimensions too. It could have
NDArray[np.bool_]. You probably know better than me :)
A few years ago, I searched for this issue in order to suggest that
numpy.array should inherit from
collections.Sequence, but now I find the arguments (especially yours!!) in this thread very convincing. Numpy arrays aren't really sequences, and shoehorning them seems like it will cause more harm than good. Why not just let them be their own thing, and force users to explicitly request the interface they want?
And many of such code will have no issue with using isinstance checks on ABCs,
Now that you mention that, maybe all of my proposed methods should have returned views. That way, they can correctly answer isinstance checks.
Methods such as the above may be reasonable as a consumer protocol to work with any array-like, but is that the problem we are trying to solve :)? They seem not like things typical Python sequences would want to expose.
I definitely agree that the answer to this depends on the problems we are trying to solve. Having drunk the type annotation kool aid, I'm interested in writing succinct numpy code that passes mypy without littering code with
# type: ignore. Which problems do you have in mind?
Well, type hints and interop with other array-like objects are a good motivation probably. I might suggest opening a new issue or mailing list thread. Right now, I am not sure what best to think about here, typing is forming, so maybe that will end up clarifying some things.