It would be nice to have an alternative to numpy.random.shuffle
that accepts an axis
argument, and that independently shuffles the one-dimensional slices. Here's an implementation that I'll call disarrange
. It works, but it would be nice to have a more efficient C implementation.
def disarrange(a, axis=-1):
"""
Shuffle `a` in-place along the given axis.
Apply numpy.random.shuffle to the given axis of `a`.
Each one-dimensional slice is shuffled independently.
"""
b = a.swapaxes(axis, -1)
# Shuffle `b` in-place along the last axis. `b` is a view of `a`,
# so `a` is shuffled in place, too.
shp = b.shape[:-1]
for ndx in np.ndindex(shp):
np.random.shuffle(b[ndx])
return
Example:
In [156]: a = np.arange(20).reshape(4,5)
In [157]: a
Out[157]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
In [158]: disarrange(a, axis=-1)
In [159]: a
Out[159]:
array([[ 2, 0, 4, 3, 1],
[ 8, 6, 7, 9, 5],
[11, 14, 13, 10, 12],
[19, 18, 16, 17, 15]])
In [160]: a = np.arange(20).reshape(4,5)
In [161]: disarrange(a, axis=0)
In [162]: a
Out[162]:
array([[ 5, 11, 7, 13, 14],
[ 0, 6, 2, 3, 4],
[10, 1, 17, 18, 19],
[15, 16, 12, 8, 9]])
This request was motivated by this question on stackoverflow: http://stackoverflow.com/questions/26310346/quickly-calculate-randomized-3d-numpy-array-from-2d-numpy-array/
Don't see why this would need to be an alternative -- why not just add an
axis argument to shuffle? Defaulting to None, like np.sum.
On Sat, Oct 11, 2014 at 9:36 PM, Warren Weckesser [email protected]
wrote:
It would be nice to have an alternative to numpy.random.shuffle that
accepts an axis argument, and that independently shuffles the
one-dimensional slices. Here's an implementation that I'll call disarrange.
It works, but it would be nice to have a more efficient C implementation.def disarrange(a, axis=-1):
"""
Shufflea
in-place along the given axis.Apply numpy.random.shuffle to the given axis of `a`. Each one-dimensional slice is shuffled independently. """ b = a.swapaxes(axis, -1) # Shuffle `b` in-place along the last axis. `b` is a view of `a`, # so `a` is shuffled in place, too. shp = b.shape[:-1] for ndx in np.ndindex(shp): np.random.shuffle(b[ndx]) return
Example:
In [156]: a = np.arange(20).reshape(4,5)
In [157]: a
Out[157]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])In [158]: disarrange(a, axis=-1)
In [159]: a
Out[159]:
array([[ 2, 0, 4, 3, 1],
[ 8, 6, 7, 9, 5],
[11, 14, 13, 10, 12],
[19, 18, 16, 17, 15]])In [160]: a = np.arange(20).reshape(4,5)
In [161]: disarrange(a, axis=0)
In [162]: a
Out[162]:
array([[ 5, 11, 7, 13, 14],
[ 0, 6, 2, 3, 4],
[10, 1, 17, 18, 19],
[15, 16, 12, 8, 9]])This request was motivated by this question on stackoverflow:
http://stackoverflow.com/questions/26310346/quickly-calculate-randomized-3d-numpy-array-from-2d-numpy-array/—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5173.
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
The current behavior of shuffle
is not really like axis=None
. It treats its argument as a one-dimensional sequence.
In [181]: a = np.arange(20).reshape(4,5)
In [182]: np.random.shuffle(a)
In [183]: a
Out[183]:
array([[ 0, 1, 2, 3, 4],
[15, 16, 17, 18, 19],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
You can interpret that as being axis=0
, but the missing feature is the independent shuffling of the 1-D slices.
For a 2-D array, you can shuffle a.T
to emulate axis=1
, but this won't get you independent shuffling:
In [184]: a = np.arange(20).reshape(4,5)
In [185]: np.random.shuffle(a.T)
In [186]: a
Out[186]:
array([[ 4, 1, 0, 3, 2],
[ 9, 6, 5, 8, 7],
[14, 11, 10, 13, 12],
[19, 16, 15, 18, 17]])
In disarrange
, I would expect axis=None
to act like np.random.shuffle(a.flat)
.
It would be fine if the alternative shuffling was implemented by adding appropriate arguments to shuffle
that control how it behaves, but I don't have a proposal for that API.
Perhaps two arguments could be added to shuffle
: axis
and independent
(or something along those lines). The new signature would be:
def shuffle(a, independent=False, axis=0)
When independent
is False, it acts like the current shuffle
. When True, it acts like disarrange
.
Oh, ugh, I just assumed that it was more consistent with analogous
functions like sort :-(. It would be nicer if this kind of
shuffling-of-slices were written like idx = arange(...); shuffle(idx);
multi_dim_array[idx, ...]; but no-one asked me :-)
I'm +1 on a version of shuffle that has calling conventions that match
np.sort, though as a rule we should check with the list. They might have
suggestions on crucial issues like the best name too :-)
(Maybe "scramble"?)
On Sat, Oct 11, 2014 at 10:31 PM, Warren Weckesser <[email protected]
wrote:
The current behavior of shuffle is not really like axis=None. It treats
its argument as a one-dimensional sequence.In [181]: a = np.arange(20).reshape(4,5)
In [182]: np.random.shuffle(a)
In [183]: a
Out[183]:
array([[ 0, 1, 2, 3, 4],
[15, 16, 17, 18, 19],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])You can interpret that as being axis=0, but the missing feature is the
independent shuffling of the 1-D slices.For a 2-D array, you can shuffle a.T to emulate axis=1, but this won't
get you independent shuffling:In [184]: a = np.arange(20).reshape(4,5)
In [185]: np.random.shuffle(a.T)
In [186]: a
Out[186]:
array([[ 4, 1, 0, 3, 2],
[ 9, 6, 5, 8, 7],
[14, 11, 10, 13, 12],
[19, 16, 15, 18, 17]])In disarrange, I would expect axis=None to act like
np.random.shuffle(a.flat).It would be fine if the alternative shuffling was implemented by adding
appropriate arguments to shuffle that control how it behaves, but I don't
have a proposal for that API.—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5173#issuecomment-58765220.
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
Ah, describing the desired behavior as an analog of sort
is a good idea.
Oh, ugh, I just assumed that it was more consistent with analogous functions like sort
I was surprised, too, and based on the comments on the stackoverflow question, at least two other experienced numpy users were surprised. I'll start a discussion on the mailing list.
I guess if the average user is currently getting it wrong then it's worth
mentioning the other option -- we _could_ add an argument to choose between
the two behaviours, which starts out defaulting to the current behaviour,
and at some point switch the default after much FutureWarning and shouting
to warn people. But that's an ugly transition to make...
On Sat, Oct 11, 2014 at 11:00 PM, Warren Weckesser <[email protected]
wrote:
Oh, ugh, I just assumed that it was more consistent with analogous
functions like sortI was surprised, too, and based on the comments on the stackoverflow
question, at least two other experienced numpy users were surprised. I'll
start a discussion on the mailing list.—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5173#issuecomment-58766099.
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
They might have suggestions on crucial issues like the best name too.
We need a function named Sue.
Just wanted to +1 this feature, as I too expected it to exist, analogously to sort(axis=N). Was there any decision made on the mailing list?
The mailing list thread is here:
http://thread.gmane.org/gmane.comp.python.numeric.general/59014
This would be really useful!
I also would appreciate that!
According to https://stackoverflow.com/a/35647011/3401634, for multi-dimensional arrays X
np.random.shuffle(X)
is the same as
np.take(X, np.random.permutation(X.shape[0]), axis=0, out=X)
So why not implement
np.random.shuffle(X, axis=axis)
as
np.take(X, np.random.permutation(X.shape[axis]), axis=axis, out=X)
with the default axis=0
?
Any news on this? I was surprised this functionality doesn't exist. For now I'm using np.apply_along_axis
with np.random.permutation
as a workaround.
Can this be closed now because of #13829?
(Note that while working on the examples here, I found a bug in the new shuffle code. In what follows, I am using the fix proposed in https://github.com/numpy/numpy/pull/14662, which has been merged.)
@wkschwartz, the change in #13829 is useful, but it is not the enhancement requested here. The axis added in #13829 still treats the array as a 1-d sequence to be shuffled. The new axis argument allows the user to specify which axis is viewed as the 1-d axis, but it does not do an independent shuffle within the axis.
For example,
In [1]: import numpy as np
In [2]: rng = np.random.default_rng()
In [3]: x = np.arange(20).reshape(2, 10)
In [4]: x
Out[4]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
In [5]: rng.shuffle(x, axis=1)
In [6]: x
Out[6]:
array([[ 5, 9, 6, 4, 7, 0, 3, 2, 1, 8],
[15, 19, 16, 14, 17, 10, 13, 12, 11, 18]])
You can see that the rows have not been independently shuffled. The columns have been rearranged, but the values within each column are the same.
The behavior requested in this issue is to shuffle independently, as in the disarrange
code I gave above:
In [10]: x = np.arange(20).reshape(2, 10)
In [11]: x
Out[11]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
In [12]: disarrange(x, axis=1)
In [13]: x
Out[13]:
array([[ 4, 3, 7, 8, 0, 6, 5, 2, 9, 1],
[12, 15, 19, 17, 18, 14, 10, 13, 11, 16]])
I would like to float this again, maybe also for wednesdays meeting. We just added higher dimensional capabilities to choice
and permutation
and in 1.18 even the axis argument (it is thus brand new).
All of these use the current shuffle logic which is shuffle the subarrays along this axis
, instead of shuffle along (individual) axis
which I think is arguably what should happen. I.e. it shuffles "over" instead of "along" or within the given axis.
But, in almost all occasions, axis
means along axis in NumPy, with hopefully very few exceptions, such as apply_over_axes
which has the "over" in the name. So I will be so bold and claim that even renaming the argument to over_axis=0
would be better to avoid confusion! Especially for random numbers where incorrect shuffling may be very hard to notice.
As noted in the github cross-reference above, I have a work-in-progress PR at https://github.com/numpy/numpy/pull/15121. I got some good feedback after submitting the PR, but I haven't made time to address all the issues that were brought up.
@WarrenWeckesser that is cool, what I am personally more urgently concerned about is that we expanded the over meaning in the new API and recently at that.
And I am wondering if we shouldn't pull that partially back, e.g. by at least renaming the axis
argument. Or even getting rid of multip-dimensional behaviour completely again for the moment...
I am probably just overreacting right now, because I am a bit annoyed that I missed this or did not think it to the end before... But I honestly think the currentl logic is very dangerous. It is easy to miss that it does not provide the expected along meaning. And it is not the meaning that np.sort
uses.
@seberg, thanks for poking this issue. I think we still need to reach consensus on the API. I'll try to give a brief summary of past ideas here. I'll follow your convention of using "over" and "along" for two interpretations of axis
. I don't know if we can, at this point, totally undo the existing "along" interpretation of axis
for shuffle
and permutation
, but I think a lot of people would be happy if it turns out we can. :)
At the end of the mailing list discussion several years ago, I ended up thinking the solution was to not change the APIs of shuffle
and permutation
, and instead introduce two new methods that randomized along the axis instead of over it. One method would work in-place, and the other would return a copy. My preference at the time was for the names permute
and permuted
, but there were some objections to those names. In the PR from last December, I called them randomly_permute
and randomly_permuted
, but those names should be considered place-holders. Before trying to decide on those names, we have to decide if adding two new functions is the right approach. From here on, for brevity, I'll refer to the proposed new methods as permute
and permuted
.
With the new functions, we would have the following related Generator
methods:
meaning operate return
of axis in-place copy
------- -------- -----------
"over" shuffle permutation
"along" permute permuted
(The methods that operate "over" the axis, shuffle
and permutation
, already exist.)
Instead of two new methods, it has been suggested that we have just one, with a parameter that controls the in-place vs. copy behavior. Two suggestions have been floated for this:
(a) Add an out
parameter. To work in-place, pass the input array as out
. If out
is not provided, return a shuffled copy.
(b) Add a boolean flag such as copy
or inplace
, that specifies the desired behavior.
The main alternative to creating new methods is to add a new parameter to the existing methods that changes how axis
is interpreted. Before listing these, I'll reiterate a comment that Robert Kern made in the mailing list thread about how the extra argument is likely to be used in practice (here referring to the independent
parameter shown below):
It seems to me a perfectly good reason to have two methods instead of
one. I can't imagine when I wouldn't be using a literal True or False
for this, so it really should be two different methods.
(Editorial digression: Inevitably in discussion like this, the issue of growing the namespace (in this case, the Generator
namespace) comes up (sometimes referred to as "namespace pollution"). Let's agree that, yes, all things being equal, a smaller namespace is better. But, like most API design decisions, there are tradeoffs to be considered. If we keep the namespace smaller but create methods with awkward or overly complicated APIs, we're not winning.)
Having said all that, here are two additions to the existing signature of shuffle
that have been suggested.
(1) shuffle(x, axis=0, independent=False)
: The boolean flag independent
determines how axis
is interpreted: False -> "over", True -> "along". (There are probably better names than independent
.)
(2) shuffle(x, axis=0, iaxis=???)
: A second argument, iaxis
, gives the axis for "along" behavior. (How this interacts with axis
needs a clear specification. Presumably giving a value for iaxis
causes axis
to be ignored.)
I think I've covered the all the various API ideas that have come up. If anyone knows of others, let us know.
I am happy with increasing the API here. I am not sure there is much reason to be against it:
kwarg
for a total behaviour switch seems not a normal pattern, I think Rober Kern was completely right there.I suppose what is going on here is that shuffle
and permutation
(and maybe choice
) can be compared to an indexing operations (i.e. take
), which uses the same meaning for axis
. And the reason why it feels a bit strange to me, is probably the downside of this definition that it can never generalize to N-D unlike typical array-aware functions (even indexing itself does if you use arr[..., index]
. That is it generalize to stacks of arrays and doing the same operation as before for each individual one).
Note that take_along_axis
provides the N-D generalizable "along" meaning for take
to N-D correctly (even if it seems complicated). apply_along_axis
and apply_over_axis
are where I got the "over" from, although I am not sure that "over" is the right word...
I find permutation
(which isn't easily changeable but should be shuffled
) to be the real outlier here. It is was shuffle
-shuffled
, permute
-permuted
then I think things start looking pretty clear and reasonable. Anyone willing to add shuffled
and start a deprecation on permutation
? permutation
is also not very consistent in its behavior with itertools.permutations
, FWIW.
I do think permutation
, permute
, permuted
is a confusing triple of similar-sounding names with different behaviors. It would be good (possibly over the long-run) to avoid this.
While it seems simple to extend the existing API, I think @rkern's point about not having keywords that radically change behavior is the best path.
I suppose for in-place vs. not-in-place, we have the alternative out=
spelling in NumPy. But since shuffle is in-place that is not a solution and shuffled is nice. It could be for permuted
(i.e. permuted(arr, out=arr)
means the same as permute(arr)
, except – unlike shuffle – it will convert to ndarray
).
In any case, I like the plan of deprecating permutation
in favor of shuffled
to clean up the new namespace!
I'm getting back to this issue (and the related PR at https://github.com/numpy/numpy/pull/15121).
Back when I created the original issue, and tried to describe the problem with the current shuffle
API, it was pointed out that one way to explain the problem is that most folks will expect the axis
argument of shuffle
to act the same as the axis
argument of sort
. The analogy with sort
is pretty good, so it might be useful to also look at how we handle the issue of in-place operation vs copying for sorting. The function numpy.sort()
accepts an array-like argument and returns a sorted copy. For in-place sorting, one uses the ndarray sort()
method. Because it is a method on an existing ndarray, the in-place operation is clear. Over in gh-15121, the argument of the in-place function that randomly permutes its argument must be an ndarray, and not an arbitrary array-like. Otherwise, the function will have to do all the shape discovery that np.array
does, and also reject inputs that turn out to be immutable (e.g. we can't do an in-place shuffle of [(1, 2, 3, 4), (5, 6, 7, 8)]
).
It would be great if we could truly replicate the sort
API, with a function that returns a shuffled copy, and an ndarray
method that shuffles in-place, but I don't think adding such a method to the ndarray
class has any chance of being accepted.
and an
ndarray
_method_ that shuffles in-place, but I don't think adding such a method to thendarray
class has any chance of being accepted.
Without a singleton generator I think this would be impossible to achieve.
@bashtage wrote
I find
permutation
(which isn't easily changeable but should beshuffled
) to be the real outlier here. [If it] wasshuffle-shuffled
,permute-permuted
then I think things start looking pretty clear and reasonable. Anyone willing to addshuffled
and start a deprecation onpermutation
?
This is what the mailing list discussion (sort of) converged to back in 2014. Here's a link to Nathaniels suggestion: https://mail.python.org/pipermail/numpy-discussion/2014-October/071364.html
His scramble[d]
is what I called randomly_permute[d]
in https://github.com/numpy/numpy/pull/15121.
If we add shuffled
as a replacement for permutation
, and call the new methods that operate along an axis permute[d]
, the table of related functions is
meaning operate
of axis in-place return copy
------- --------- -----------
"over" shuffle shuffled
"along" permute permuted
which has a nice consistency. In this version of the API, none of the methods have an out
parameter.
Over in https://github.com/numpy/numpy/pull/15121, I recently added another method, with the ungainly and obviously temporary name permuted_with_out
that demonstrates how the out
argument might be used. If we go with an out
parameter, and stick with the names of the existing methods shuffle
and permutation
, the table looks like
meaning operate
of axis in-place return copy
------- --------------------------------- --------------------
"over" shuffle(x, axis) permutation(x, axis)
"along" permuted_with_out(x, axis, out=x) permuted_with_out(x, axis)
But if we are going to introduce an out
parameter, we should be consistent and use it in permutation
, too. And we can still consider replacing permutation
with shuffled
. And since the new shuffled
method has an out
parameter, which allows in-place operation, shuffle
becomes redundant and can be deprecated along with permutation
. Then, switching to the "nice" names of shuffled
and permuted
, the table is
meaning operate
of axis in-place return copy
------- ------------------------ -----------------
"over" shuffled(x, axis, out=x) shuffled(x, axis)
"along" permuted(x, axis, out=x) permuted(x, axis)
Note that the out
parameter is not just for operating in-place. It allows an output array to be reused, potentially avoiding the creation of a temporary array. This is an advantage of this API over the shuffle/shuffled/permute/permuted
API, but I'm not sure how signficant that advantage really is. The disadvantage of this API is the deprecation of two methods, shuffle
and permutation
. These can be "soft" deprecations for a while (i.e. deemphasize their use in the docs, but don't actually add a deprecation warning for a while) to lessen the immediate impact.
That's my summary of the two main contenders for the change. We have the shuffle/shuffled/permute/permuted
version, or the version with shuffled/permuted
with an out
parameter. If, back in 2014, someone had jumped in to implement the changes that were discussed, we'd probably have the shuffle/shuffled/permute/permuted
version already. But the version using out
has a couple (small? insignificant?) advantages: two names instead of four, and out
potentially allows a user to have fewer temporary variables. I would be happy with either one.
What do folks think?
Of the three scenarios you listed, in order, I would rank them 1, 3, and quite far behind 2. The 2 permutations that are doing quite radically different things seems like a big source of confusion. My personal preference is to avoid the mandatory use of out to access a feature; I always think of out as a performance choice that can make sense in some scenarios. I would not really like to teach students to use out just to access a feature. I would also assume that in case 3 x = shuffled(x, axis, out=x)
would also return x
rather than return None
, so that while it is in place, one might end up with x
appearing 3 times.
My personal preference is to avoid the mandatory use of out to access a feature; I always think of out as a performance choice that can make sense in some scenarios.
But shuffling in place _is_ a performance choice, isn't it?
But shuffling in place _is_ a performance choice, isn't it?
In-place can also be a coding style choice, when available. Perhaps a confusing, and maybe error-prone one.
My personal take is that when f(x, out=x) always feels a bit magical since it is sometimes used as a very non-obvious way to achieve something quick. f(x, inplace=True), despite not looking like anything else, seems much clearer (looks a bit like an old pandas pattern that has mostly been removed).
True, but it is a coding style choice that in NumPy seems typically spelled using out=...
(unless you are using an in-place operator or a method). Or maybe it is a coding style choice that NumPy doesn't actively try to make easy in most cases currently...
I admit its a bit magical and an inplace=
kwarg may be less magical, but also without real precedence? And I am not sure if the main reason it seems less magical is that the in-place shuffle is at the core of the algorithm here. Algorithmic details should not matter much to most students and in the end it using out=
also safes approximately a single copy+the associated memory bandwidth, and is is comparable to ufuncs. (Fair enough, also for ufuncs out=input
is maybe somewhat magical, but its common magic and a known pattern – for advanced users.)
While possible a bit tedious to write, and somewhat less quick to read, np.shuffled(x, out=x)
seems very clear as to what the behaviour is. The non-obvious part seems only the performance impact, which to me seems like an issue reserved for advanced users to worry about.
A hypothetical question for those advocating the use of out
: if we didn't have the existing functions numpy.sort
and ndarray.sort
, and we were adding a sorting function now, would the preferred API be numpy.sorted(a, axis=-1, kind=None, order=None, out=None)
(with no need to implement the method ndarray.sort
for in-place sorting)?
ndarray.sort
is modelled after list.sort
, so it probably is a sensible API choice regardless. That said, I would have been in favor of np.sort
not existing, and np.sorted(..., out=...)
instead.
Yes, I think np.sort
should be named np.sorted
(just like python's sorted()
after all). Since only the method has the in-place behaviour, I don't see much of a concern though.
I am not sure about the "with no need to implemented the method ndarray.sort
". I do not see anything wrong with the method (or its in-place behaviour). The question about the method is merely if we feel it is important enough to provide a convenient method short-hand.
I suppose there is also nothing wrong with having an in-place function version. The not-in-place version just seems nicer to new users and out=
pattern common enough to me that advanced users are sufficiently well served.
I am not sure about the "with no need to implemented the method ndarray.sort". I do not see anything wrong with the method (or its in-place behaviour).
That was part of my API thought experiment. I'm didn't mean to imply that there is anything wrong with what we have now. I was just saying that, if we started from scratch--and I'll add to my hypothetical premises that we aren't concerned with matching the Python API for lists--then the preferred API for sorting would be numpy.sorted(..., out=...)
, and we wouldn't need anything else.
Another question, not so hypothetical: if using out
is the preferred option here, then, for API consistency throughout NumPy, should we plan on eventually adding out
to numpy.sort
, numpy.partition
, numpy.argsort
, and, well, everything else that doesn't currently have it?
Yes, in my opinion adding an out=
kwarg with the same semantics as for ufuncs is a good choice for practically all NumPy API function. Any lack of an out
argument is generally an enhancement waiting to be made (although, I guess in practice it may be a small enhancement and in rare cases possibly not worth too much added code complexity).
Most helpful comment
Any news on this? I was surprised this functionality doesn't exist. For now I'm using
np.apply_along_axis
withnp.random.permutation
as a workaround.