Numpy: Supporting duck array coercion

Created on 25 Jun 2019  ·  55Comments  ·  Source: numpy/numpy

Opening this issue after some discussion with @shoyer, @pentschev, and @mrocklin in issue ( https://github.com/dask/dask/issues/4883 ). AIUI this was discussed in NEP 22 (so I'm mainly parroting other people's ideas here to renew discussion and correct my own misunderstanding ;).

It would be useful for various downstream array libraries to have a function to ensure we have some duck array (like ndarray). This would be somewhat similar to np.asanyarray, but without the requirement of subclassing. It would allow libraries to return their own (duck) array type. If no suitable conversion was supported by the object, we could fallback to handle ndarray subclasses, ndarrays, and coercion of other things (nested lists) to ndarrays.

cc @njsmith (who coauthored NEP 22)

01 - Enhancement numpy.core

Most helpful comment

Maybe quack_array :)

All 55 comments

The proposed implementation would look something like the following:

import numpy as np

# hypothetical np.duckarray() function
def duckarray(array_like):
  if hasattr(array_like, '__duckarray__'):
    # return an object that can be substituted for np.ndarray
    return array_like.__duckarray__()
  return np.asarray(array_like)

Example usage:

class SparseArray:
  def __duckarray__(self):
    return self
  def __array__(self):
    raise TypeError

np.duckarray(SparseArray())  # returns a SparseArray object
np.array(SparseArray())  # raises TypeError

Here I've used np.duckarray and __duckarray__ as placeholders, but we can probably do better for these names. See the Terminology from NEP 22:

“Duck array” works fine as a placeholder for now, but it’s pretty jargony and may confuse new users, so we may want to pick something else for the actual API functions. Unfortunately, “array-like” is already taken for the concept of “anything that can be coerced into an array” (including e.g. list objects), and “anyarray” is already taken for the concept of “something that shares ndarray’s implementation, but has different semantics”, which is the opposite of a duck array (e.g., np.matrix is an “anyarray”, but is not a “duck array”). This is a classic bike-shed so for now we’re just using “duck array”. Some possible options though include: arrayish, pseudoarray, nominalarray, ersatzarray, arraymimic, …

Some other name ideas: np.array_compatible(), np.array_api()....

np.array_compatible could work, although I'm not sure I like it better than duckarray. np.array_api I don't like, gives the wrong idea imho.

Since after a long time we haven't come up with a better name, perhaps we should just bless the "duck-array" name......

I like the compatible word, maybe we can think of variations along that line as well as_compatible_array (somewhat implies that all compatible objects are arrays). The as is maybe annoying (partially because all as functions have no spaces). "duck" seems nice in libraries, but I think a bit strange for random people seeing it. So I think I dislike "duck" if and only if we want downstream users to use it a lot (i.e. even when I start writing a small tool for myself/a small lab).

Maybe quack_array :)

To extend a bit on the topic, there's one other case that isn't covered with np.duckarray, which is the creation of new arrays with a type based on an existing type, similar to what functions such as np.empty_like do. Currently we can do things like this:

>>> import numpy as np, cupy as cp
>>> a  = cp.array([1, 2])
>>> b = np.ones_like(a)
>>> type(b)
<class 'cupy.core.core.ndarray'>

On the other hand, if we have an array_like that we would like to create a CuPy array from via NumPy's API, that's not possible. I think it would be helpful to have something like:

import numpy as np, cupy as cp
a  = cp.array([1, 2])
b = [1, 2]
c = np.asarray(b, like=a)

Any ideas/suggestions on this?

Maybe np.copy_like? We would want to define carefully which properties
(e.g., including dtype or not) are copied from the other array.

On Mon, Jul 1, 2019 at 5:40 AM Peter Andreas Entschev <
[email protected]> wrote:

To extend a bit on the topic, there's one other case that isn't covered
with np.duckarray, which is the creation of new arrays with a type based
on an existing type, similar to what functions such as np.empty_like do.
Currently we can do things like this:

import numpy as np, cupy as cp>>> a = cp.array([1, 2])>>> b = np.ones_like(a)>>> type(b)

On the other hand, if we have an array_like that we would like to create
a CuPy array from via NumPy's API, that's not possible. I think it would be
helpful to have something like:

import numpy as np, cupy as cp
a = cp.array([1, 2])
b = [1, 2]
c = np.asarray(b, like=a)

Any ideas/suggestions on this?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13831?email_source=notifications&email_token=AAJJFVRCWDHRAXHHRDHXXM3P5H3LRA5CNFSM4H3HQWAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY57YVQ#issuecomment-507247702,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAJJFVRSYHUYHMPWQTW2NLLP5H3LRANCNFSM4H3HQWAA
.

np.copy_like sounds good too. I agree, we most likely should have ways to control things such as dtype.

Sorry for the beginner's question, but should something like np.copy_like be an amendment to NEP-22, should it be discussed in the mailing list, or what would be the most appropriate approach to that?

We don't really have strict rules about this, but I would lean towards putting np.copy_like and np.duckarray (or whatever we call it) together into a new NEP on coercing/creating duck arrays, one that is prescriptive like NEP 18 rather than "Informational" like NEP 22. It doesn't need to be long, most of the motivation is already clear from referencing NEP 18/22.

One note about np.copy_like(): it should definitely do dispatching with __array_function__ (or something like it), so operations like np.copy_like(sparse_array, like=dask_array) could be defined either on either array type.

Great, thanks for the info, and I agree with your dispatching proposal. I will work on an NEP for the implementation of both np.duckarray and np.copy_like and submit a draft PR this week for that.

Awesome, thank you Peter!

On Mon, Jul 1, 2019 at 9:29 AM Peter Andreas Entschev <
[email protected]> wrote:

Great, thanks for the info, and I agree with your dispatching proposal. I
will work on an NEP for the implementation of both np.duckarray and
np.copy_like and submit a draft PR this week for that.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/numpy/numpy/issues/13831?email_source=notifications&email_token=AAJJFVW2YUBNUCJZK6JWDBTP5IWHNA5CNFSM4H3HQWAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY6VM3Q#issuecomment-507336302,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAJJFVR2KTPAZ4JPWDYYMFLP5IWHNANCNFSM4H3HQWAA
.

My pleasure, and thanks a lot for the ideas and support with this work!

The array_like and copy_like functions would be a little odd to have in the main namespace I think, since we can't have a default implementation (at least not one that would do the right think for cupy/dask/sparse/etc), right? They're only useful when overridden. Or am I missing a way to create arbitrary non-numpy array objects here?

It's true, these would only really be useful if you want to support duck typing. But certainly np.duckarray and np.copy_like would work even if the arguments are only NumPy arrays -- they would just be equivalent to np.array/np.copy.

All array implementations have a copy method right? Using that instead of copy_like should work, so why add a new function?

array_like I can see the need for, but we may want to discuss where to put it.

np.duckarray does make sense to me.

I would lean towards putting np.copy_like and np.duckarray (or whatever we call it) together into a new NEP on coercing/creating duck arrays, one that is prescriptive like NEP 18 rather than "Informational" like NEP 22.

+1

array_like I can see the need for, but we may want to discuss where to put it.

That's actually the case which I would like to have addressed with something like np.copy_like. I haven't tested, but probably np.copy already dispatches correctly if the array is non-NumPy.

Just to be clear, are you referring also to a function np.array_like? I intentionally avoided such a name because I thought it could be confusing to all existing references to array_like-arrays. However, I do now realize that np.copy_like may imply a necessary copy, and I think it would be good to have a behavior similar to np.asarray, where the copy only happens if it's not already a NumPy array. In the case discussed here, the best would be to make the copy only if a is not the same type as b in a call such as np.copy_like(a, like=b).

I haven't tested, but probably np.copy already dispatches correctly if the array is non-NumPy.

It should, it's decorated to support __array_function__.

Just to be clear, are you referring also to a function np.array_like? I intentionally avoided such a name because I thought it could be confusing to all existing references to array_like-arrays.

Yes. And yes agree it can be confusing.

However, I do now realize that np.copy_like may imply a necessary copy,

Yes that name implies a data copy.

may imply a necessary copy, and I think it would be good to have a behavior similar to np.asarray,

I thought that that was np.duckarray.

I think Peter's example above might help clarify this. Copied below and subbed in np.copy_like for simplicity.

import numpy as np, cupy as cp
a  = cp.array([1, 2])
b = [1, 2]
c = np.copy_like(b, like=a)

I thought that that was np.duckarray.

Actually, np.duckarray will basically do nothing and just return the array itself (if overriden), else return np.asarray (leading to a NumPy array). We can't get a CuPy array from a Python list with it, for example. We still need a function that can be dispatched to CuPy (or any other like= array) for an array_like.

Thanks @jakirkham for the updated example.

c = np.copy_like(b, like=a)

So that will dispatch to CuPy via a.__array_function__ and fail if that attribute doesn't exist (e.g. a=<scipy.sparse matrix> wouldn't work)? It feels like we need a new namespace or new interoperability utilities package for those kind of things. Either that or leave it to a more full-featured future dispatching mechanism where one could simple do:

with cupy_backend:
   np.array(b)

Introducing new functions in the main namespace that don't make sense for NumPy itself to support working around a limitation of __array_function__ seems a bit unhealthy....

So that will dispatch to CuPy via a.__array_function__ and fail if that attribute doesn't exist (e.g. a=<scipy.sparse matrix> wouldn't work)?

I wouldn't say it has to fail necessarily. We could default to NumPy and raise a warning (or don't raise it at all), for example.

It feels like we need a new namespace or new interoperability utilities package for those kind of things. Either that or leave it to a more full-featured future dispatching mechanism

Certainly it would be nice to have a full-featured dispatching mechanism, but I imagine this wasn't done before due to its complexity and backwards compatibility issues? I wasn't around when discussions happened, so just guessing.

Introducing new functions in the main namespace that don't make sense for NumPy itself to support working around a limitation of __array_function__ seems a bit unhealthy....

I certainly see your point, but I also think that if we move too many things away from main namespace, it could scare users off. Maybe I'm wrong and this is just an impression. Either way, I'm not at all proposing to implement functions that won't work with NumPy, but perhaps only not absolutely necessary when using NumPy by itself.

Introducing new functions in the main namespace that don't make sense for NumPy itself to support working around a limitation of __array_function__ seems a bit unhealthy....

Actually, in this sense, also np.duckarray wouldn't belong in the main namespace.

Actually, in this sense, also np.duckarray wouldn't belong in the main namespace.

I think that one is more defensible (analogous to asarray and it would basically check "does this meet our definition of a ndarray-like duck type"), but yes. If we also want to expose array_function_dispatch, and we have things np.lib.mixins.NDArrayOperatorsMixin and plan on writing more mixins, a sensible new submodule for all things interoperability related could make sense.

Certainly it would be nice to have a full-featured dispatching mechanism, but I imagine this wasn't done before due to its complexity and backwards compatibility issues? I wasn't around when discussions happened, so just guessing.

I think there's multiple reasons. __array_function__ is similar to things we already had, so it's easier to reason about. It has low overhead. It could be designed and implemented on a ~6 month timescale, and @shoyer made a strong case that we needed that. And we had no concrete alternative.

sensible new submodule for all things interoperability related could make sense.

No real objections from me, I think it's better to have functionality somewhere rather than nowhere. :)

I think there's multiple reasons. __array_function__ is similar to things we already had, so it's easier to reason about. It has low overhead. It could be designed and implemented on a ~6 month timescale, and @shoyer made a strong case that we needed that. And we had no concrete alternative.

But if we want to leverage __array_function__ more broadly, do we have other alternatives now to implementing things like np.duckarray and np.copy_like (or whatever else we would decide to call it)? I'm open to all alternatives, but right now I don't see any, of course, rather than going the full-feature dispatching way, which is likely going to take a long time and limit the scope of __array_function__ tremendously (and basically rendering it impractical for most of the more complex cases I've seen).

But if we want to leverage __array_function__ more broadly, do we have other alternatives now to implementing things like np.duckarray and np.copy_like (or whatever else we would decide to call it)?

I think you indeed need a set of utility features like that, to go from covering some fraction of use cases to >80% of use cases. I don't think there's a way around that. I just don't like cluttering up the main namespace, so propose to find a better place for those.

I'm open to all alternatives, but right now I don't see any, of course, rather than going the full-feature dispatching way, which is likely going to take a long time and limit the scope of __array_function__ tremendously (and basically rendering it impractical for most of the more complex cases I've seen).

I mean, we're just plugging a few obvious holes here right? We're never going to cover all of the "more complex cases". Say you want to override np.errstate or np.dtype, that's just not going to happen with the protocol-based approach.

As for alternatives, uarray is not yet there and I'm not convinced yet that the overhead will be pushed down low enough to be used by default in NumPy, but it's getting close and we're about to try it to create the scipy.fft backend system (WIP PR: https://github.com/scipy/scipy/pull/10383). If that does prove itself there, it should be considered as a complete multiple dispatch solution. And it already has a numpy API with Dask/Sparse/CuPy/PyTorch/XND backends, some of which are complete enough to be usable: https://github.com/Quansight-Labs/uarray/tree/master/unumpy

The dispatch approach with uarray is certainly interesting. Though I'm still concerned about how we handle meta-arrays (like Dask, xarray, etc.). Please see this comment for details. It's unclear this has been addressed (though please correct me if I've missed something). I'd be interested in working with others at SciPy to try and hash out how we solve this problem.

Please see this comment for details. It's unclear this has been addressed (though please correct me if I've missed something).

I think the changes of the last week resolve that, but not sure - let's leave that for another thread.

I'd be interested in working with others at SciPy to try and hash out how we solve this problem.

I'll be there, would be great to meet you in person.

Maybe np.coerce_like() or np.cast_like() would be a better names than copy_like, so that it's clear that copies are not necessarily required. The desired functionality is indeed pretty similar to the .cast() method, except we want to convert array types as well as dtypes, and it should be a function rather than a protocol so it can be implemented by either argument.

The dispatch approach with uarray is certainly interesting. Though I'm still concerned about how we handle meta-arrays (like Dask, xarray, etc.).

uarray has support for multiple backends so something like this should work

with ua.set_backend(inner_array_backend), ua.set_backend(outer_array_backend):
  s = unumpy.sum(meta_array)

This could be done by having the meta-array call ua.skip_backend inside of its implementation, or if the meta-array's backend returns NotImplemented on type mismatch.

cc: @hameerabbasi

I’ll expand on this: As a general rule, for dask.array, anything with da would be written without a skip_backend. Anything with NumPy would need a skip_backend.

Or for da you can always skip dispatch and call your own implementation directly and have skip_backend(dask.array) everywhere.

As for dispatching functions that don’t have an array attached, like ones, cast, you would just set a backend and be done. Same for np.errstate and np.dtype. There’s an example covering np.ufunc in unumpy.

As for the original issue, uarray provides the __ua_convert__ Protocol, which does exactly this. An alternative would be for backends to override asarray directly.

Thanks for the heads up on uarray, @rgommers, @peterbell10, @hameerabbasi.

But as I see, you _must_ set the proper backend before launching computation, is that correct? One of the advantages of __array_function__ is libraries can be entirely agnostic of other libraries, such as Dask doesn't need to know of the existence of CuPy, for example.

@pentschev This was the case until recently, when we added the ability to “register” a backend, but we recommend only NumPy (or a reference implementation) does this. Then users using Dask would need just a single set_backend.

Got it, I guess this is what @rgommers mentioned in https://github.com/numpy/numpy/issues/13831#issuecomment-507432311, pointing to the backends in https://github.com/Quansight-Labs/uarray/tree/master/unumpy.

Sorry for so many questions, but what if some hypothetical application relies on various backends, for example, both NumPy and Sparse, where depending on the user input, maybe everything will be NumPy-only, Sparse-only, or a mix of both. @peterbell10 mentioned multiple backends are supported https://github.com/numpy/numpy/issues/13831#issuecomment-507458331, but can the selection of backend be made automatic or would there be a need to handle the three cases separately?

So, for this case, you would ideally register NumPy, use a context manager for Sparse, and return NotImplemented from sparse when appropriate, which would make something fall-back to NumPy.

At SciPy @rgommers, @danielballan, and myself talked about this issue. We concluded it would be valuable to proceed with adding duckarray (using that name). That said, it sounded like this would be slated for 1.18. Though please correct me if I misunderstood things. Given this, would be alright to start a PR?

We concluded it would be valuable to proceed with adding duckarray (using that name). That said, it sounded like this would be slated for 1.18. Though please correct me if I misunderstood things. Given this, would be alright to start a PR?

This all sounds great to me, but it would be good to start with a short NEP spelling out the exact proposal. See https://github.com/numpy/numpy/issues/13831#issuecomment-507334210

Sure that makes sense. 🙂

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

We concluded it would be valuable to proceed with adding duckarray (using that name).

This all sounds great to me, but it would be good to start with a short NEP spelling out the exact proposal. See #13831 (comment)

I have already started to write that, haven't been able to complete it yet though (sorry for my bad planning https://github.com/numpy/numpy/issues/13831#issuecomment-507336302).

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

You can do that, but it may require special copying logic (such as in CuPy https://github.com/cupy/cupy/pull/2079).

That said, a copy function may be best, to avoid this sort additional code from being necessary.

On the other hand, this would be sort of a replacement for asarray. So I was wondering if instead of some copy_like new function, we would instead want to revisit the idea suggested by NEP-18:

These will need their own protocols:
...
array and asarray, because they are explicitly intended for coercion to actual numpy.ndarray object.

If there's a chance we would like to revisit that, maybe would be better to start a new thread. Any ideas, suggestions, objections?

Just to be clear on my comment above, I myself don't know if a new protocol is a great idea (probably many cumbersome details that I don't foresee are involved), really just wondering if that's an idea we should revisit and discuss.

The consensus from the dev meeting and sprint at SciPy'19 was: let's get 1.17.0 out the door and get some real-world experience with it before taking any next steps.

really just wondering if that's an idea we should revisit and discuss.

probably yes, but in a few months.

probably yes, but in a few months.

Ok, thanks for the reply!

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

My main issue with this is that it wouldn't work for duck arrays that are immutable, which is not terribly uncommon. Also, for NumPy the additional cost of allocating an array and then filling it may be nearly zero, but I'm not sure that's true for all duck arrays.

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

You can do that, but it may require special copying logic (such as in CuPy cupy/cupy#2079).

That said, a copy function may be best, to avoid this sort additional code from being necessary.

On the other hand, this would be sort of a replacement for asarray. So I was wondering if instead of some copy_like new function, we would instead want to revisit the idea suggested by NEP-18:

These will need their own protocols:
...
array and asarray, because they are explicitly intended for coercion to actual numpy.ndarray object.

If there's a chance we would like to revisit that, maybe would be better to start a new thread. Any ideas, suggestions, objections?

I don't think it's a good idea to change the behavior of np.array or np.asarray with a new protocol. Their established meaning is to cast to NumPy arrays, which is basically why we need np.duckarray

That said, we could consider adding a like argument to duckarray. That would require changing the protocol from the simplified proposal above -- maybe to use __array_function__ instead of a dedicated protocol like __duckarray__? I haven't really thought this through.

As for the copying point that has been brought up previously, I'm curious if this isn't solved through existing mechanisms. In particular what about these lines?

a2 = np.empty_like(a1)
a2[...] = a1[...]

Admittedly it would be nice to get this down to one line. Just curious whether this already works for that use case or if we are missing things.

My main issue with this is that it wouldn't work for duck arrays that are immutable, which is not terribly uncommon. Also, for NumPy the additional cost of allocating an array and then filling it may be nearly zero, but I'm not sure that's true for all duck arrays.

That's fair. Actually we can already simplify things. For instance this works with CuPy and Sparse today.

a2 = np.copy(a1)

That's fair. Actually we can already simplify things. For instance this works with CuPy and Sparse today.

a2 = np.copy(a1)

Yes, but we also want "copy this duck-array into the type of this other duck-array"

I don't think it's a good idea to change the behavior of np.array or np.asarray with a new protocol. Their established meaning is to cast to NumPy arrays, which is basically why we need np.duckarray

I'm also unsure about this, and I was reluctant even to raise this question, this is why I hadn't until today.

That said, we could consider adding a like argument to duckarray. That would require changing the protocol from the simplified proposal above -- maybe to use __array_function__ instead of a dedicated protocol like __duckarray__? I haven't really thought this through.

I don't know if there would be any complications with that, we probably need some careful though, but I tend to like this idea. That would seem redundant in various levels, but maybe to follow the existing pattern, instead of adding a like parameter we could have duckarray and duckarray_like?

Yes, but we also want "copy this duck-array into the type of this other duck-array"

What about basing this around np.copyto?

What about basing this around np.copyto?

Feel free to correct me if I'm wrong, but I'm assuming you mean something like:

np.copyto(cupy_array, numpy_array)

That could work, assuming NumPy is willing to change the current behavior, e.g., asarray always implies the destination is a NumPy array, does copyto make the same assumption?

np.copyto already supporting dispatching with __array_function__, but it's roughly equivalent to:

def copyto(dst, src):
    dst[...] = src

We want the equivalent of:

def copylike(src, like):
    dst = np.empty_like(like)
    dst[...] = src
    return dst

np.copyto already supporting dispatching with __array_function__, but it's roughly equivalent to:

def copyto(dst, src):
    dst[...] = src

We want the equivalent of:

def copylike(src, like):
    dst = np.empty_like(like)
    dst[...] = src
    return dst

Correct, this is what we want. copyto gets dispatched and works if source and destination have the same type, we need something that allows dispatching to the destination array's library.

Well copyto could still make sense depending on how we thinking of it. Take for example the following use case.

np.copyto(cp.ndarray, np.random.random((3,)))

This could translate into something like allocate and copy over the data as we have discussed. If we dispatch around dst (cp.ndarray in this case), then libraries with immutable arrays could implement this in a suitable manner as well. It also saves us from adding a new API (that NumPy merely provides, but doesn't use), which seemed to be a concern.

Just to surface another thought that occurred to me recently, it's worthing thinking about what these APIs will mean downstream between other libraries (for instance how Dask and Xarray interact).

Was this page helpful?
0 / 5 - 0 ratings