Numpy: ENH: Alternative to `random.shuffle`, with an `axis` argument.

Created on 11 Oct 2014  ·  35Comments  ·  Source: numpy/numpy

It would be nice to have an alternative to numpy.random.shuffle that accepts an axis argument, and that independently shuffles the one-dimensional slices. Here's an implementation that I'll call disarrange. It works, but it would be nice to have a more efficient C implementation.

def disarrange(a, axis=-1):
    """
    Shuffle `a` in-place along the given axis.

    Apply numpy.random.shuffle to the given axis of `a`.
    Each one-dimensional slice is shuffled independently.
    """
    b = a.swapaxes(axis, -1)
    # Shuffle `b` in-place along the last axis.  `b` is a view of `a`,
    # so `a` is shuffled in place, too.
    shp = b.shape[:-1]
    for ndx in np.ndindex(shp):
        np.random.shuffle(b[ndx])
    return

Example:

In [156]: a = np.arange(20).reshape(4,5)

In [157]: a
Out[157]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [158]: disarrange(a, axis=-1)

In [159]: a
Out[159]: 
array([[ 2,  0,  4,  3,  1],
       [ 8,  6,  7,  9,  5],
       [11, 14, 13, 10, 12],
       [19, 18, 16, 17, 15]])

In [160]: a = np.arange(20).reshape(4,5)

In [161]: disarrange(a, axis=0)

In [162]: a
Out[162]: 
array([[ 5, 11,  7, 13, 14],
       [ 0,  6,  2,  3,  4],
       [10,  1, 17, 18, 19],
       [15, 16, 12,  8,  9]])

This request was motivated by this question on stackoverflow: http://stackoverflow.com/questions/26310346/quickly-calculate-randomized-3d-numpy-array-from-2d-numpy-array/

01 - Enhancement numpy.random

Most helpful comment

Any news on this? I was surprised this functionality doesn't exist. For now I'm using np.apply_along_axis with np.random.permutation as a workaround.

All 35 comments

Don't see why this would need to be an alternative -- why not just add an
axis argument to shuffle? Defaulting to None, like np.sum.

On Sat, Oct 11, 2014 at 9:36 PM, Warren Weckesser [email protected]
wrote:

It would be nice to have an alternative to numpy.random.shuffle that
accepts an axis argument, and that independently shuffles the
one-dimensional slices. Here's an implementation that I'll call disarrange.
It works, but it would be nice to have a more efficient C implementation.

def disarrange(a, axis=-1):
"""
Shuffle a in-place along the given axis.

Apply numpy.random.shuffle to the given axis of `a`.
Each one-dimensional slice is shuffled independently.
"""
b = a.swapaxes(axis, -1)
# Shuffle `b` in-place along the last axis.  `b` is a view of `a`,
# so `a` is shuffled in place, too.
shp = b.shape[:-1]
for ndx in np.ndindex(shp):
    np.random.shuffle(b[ndx])
return

Example:

In [156]: a = np.arange(20).reshape(4,5)

In [157]: a
Out[157]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])

In [158]: disarrange(a, axis=-1)

In [159]: a
Out[159]:
array([[ 2, 0, 4, 3, 1],
[ 8, 6, 7, 9, 5],
[11, 14, 13, 10, 12],
[19, 18, 16, 17, 15]])

In [160]: a = np.arange(20).reshape(4,5)

In [161]: disarrange(a, axis=0)

In [162]: a
Out[162]:
array([[ 5, 11, 7, 13, 14],
[ 0, 6, 2, 3, 4],
[10, 1, 17, 18, 19],
[15, 16, 12, 8, 9]])

This request was motivated by this question on stackoverflow:
http://stackoverflow.com/questions/26310346/quickly-calculate-randomized-3d-numpy-array-from-2d-numpy-array/


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5173.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

The current behavior of shuffle is not really like axis=None. It treats its argument as a one-dimensional sequence.

In [181]: a = np.arange(20).reshape(4,5)

In [182]: np.random.shuffle(a)

In [183]: a
Out[183]: 
array([[ 0,  1,  2,  3,  4],
       [15, 16, 17, 18, 19],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

You can interpret that as being axis=0, but the missing feature is the independent shuffling of the 1-D slices.

For a 2-D array, you can shuffle a.T to emulate axis=1, but this won't get you independent shuffling:

In [184]: a = np.arange(20).reshape(4,5)

In [185]: np.random.shuffle(a.T)

In [186]: a
Out[186]: 
array([[ 4,  1,  0,  3,  2],
       [ 9,  6,  5,  8,  7],
       [14, 11, 10, 13, 12],
       [19, 16, 15, 18, 17]])

In disarrange, I would expect axis=None to act like np.random.shuffle(a.flat).

It would be fine if the alternative shuffling was implemented by adding appropriate arguments to shuffle that control how it behaves, but I don't have a proposal for that API.

Perhaps two arguments could be added to shuffle: axis and independent (or something along those lines). The new signature would be:

def shuffle(a, independent=False, axis=0)

When independent is False, it acts like the current shuffle. When True, it acts like disarrange.

Oh, ugh, I just assumed that it was more consistent with analogous
functions like sort :-(. It would be nicer if this kind of
shuffling-of-slices were written like idx = arange(...); shuffle(idx);
multi_dim_array[idx, ...]; but no-one asked me :-)

I'm +1 on a version of shuffle that has calling conventions that match
np.sort, though as a rule we should check with the list. They might have
suggestions on crucial issues like the best name too :-)

(Maybe "scramble"?)

On Sat, Oct 11, 2014 at 10:31 PM, Warren Weckesser <[email protected]

wrote:

The current behavior of shuffle is not really like axis=None. It treats
its argument as a one-dimensional sequence.

In [181]: a = np.arange(20).reshape(4,5)

In [182]: np.random.shuffle(a)

In [183]: a
Out[183]:
array([[ 0, 1, 2, 3, 4],
[15, 16, 17, 18, 19],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])

You can interpret that as being axis=0, but the missing feature is the
independent shuffling of the 1-D slices.

For a 2-D array, you can shuffle a.T to emulate axis=1, but this won't
get you independent shuffling:

In [184]: a = np.arange(20).reshape(4,5)

In [185]: np.random.shuffle(a.T)

In [186]: a
Out[186]:
array([[ 4, 1, 0, 3, 2],
[ 9, 6, 5, 8, 7],
[14, 11, 10, 13, 12],
[19, 16, 15, 18, 17]])

In disarrange, I would expect axis=None to act like
np.random.shuffle(a.flat).

It would be fine if the alternative shuffling was implemented by adding
appropriate arguments to shuffle that control how it behaves, but I don't
have a proposal for that API.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5173#issuecomment-58765220.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

Ah, describing the desired behavior as an analog of sort is a good idea.

Oh, ugh, I just assumed that it was more consistent with analogous functions like sort

I was surprised, too, and based on the comments on the stackoverflow question, at least two other experienced numpy users were surprised. I'll start a discussion on the mailing list.

I guess if the average user is currently getting it wrong then it's worth
mentioning the other option -- we _could_ add an argument to choose between
the two behaviours, which starts out defaulting to the current behaviour,
and at some point switch the default after much FutureWarning and shouting
to warn people. But that's an ugly transition to make...

On Sat, Oct 11, 2014 at 11:00 PM, Warren Weckesser <[email protected]

wrote:

Oh, ugh, I just assumed that it was more consistent with analogous
functions like sort

I was surprised, too, and based on the comments on the stackoverflow
question, at least two other experienced numpy users were surprised. I'll
start a discussion on the mailing list.


Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/issues/5173#issuecomment-58766099.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

They might have suggestions on crucial issues like the best name too.

We need a function named Sue.

Just wanted to +1 this feature, as I too expected it to exist, analogously to sort(axis=N). Was there any decision made on the mailing list?

This would be really useful!

I also would appreciate that!

According to https://stackoverflow.com/a/35647011/3401634, for multi-dimensional arrays X

np.random.shuffle(X)

is the same as

np.take(X, np.random.permutation(X.shape[0]), axis=0, out=X)

So why not implement

np.random.shuffle(X, axis=axis)

as

np.take(X, np.random.permutation(X.shape[axis]), axis=axis, out=X)

with the default axis=0?

Any news on this? I was surprised this functionality doesn't exist. For now I'm using np.apply_along_axis with np.random.permutation as a workaround.

Can this be closed now because of #13829?

(Note that while working on the examples here, I found a bug in the new shuffle code. In what follows, I am using the fix proposed in https://github.com/numpy/numpy/pull/14662, which has been merged.)

@wkschwartz, the change in #13829 is useful, but it is not the enhancement requested here. The axis added in #13829 still treats the array as a 1-d sequence to be shuffled. The new axis argument allows the user to specify which axis is viewed as the 1-d axis, but it does not do an independent shuffle within the axis.

For example,

In [1]: import numpy as np                                                      

In [2]: rng = np.random.default_rng()                                           

In [3]: x = np.arange(20).reshape(2, 10)                                        

In [4]: x                                                                       
Out[4]: 
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])

In [5]: rng.shuffle(x, axis=1)                                                  

In [6]: x                                                                       
Out[6]: 
array([[ 5,  9,  6,  4,  7,  0,  3,  2,  1,  8],
       [15, 19, 16, 14, 17, 10, 13, 12, 11, 18]])

You can see that the rows have not been independently shuffled. The columns have been rearranged, but the values within each column are the same.

The behavior requested in this issue is to shuffle independently, as in the disarrange code I gave above:

In [10]: x = np.arange(20).reshape(2, 10)                                       

In [11]: x                                                                      
Out[11]: 
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])

In [12]: disarrange(x, axis=1)                                                  

In [13]: x                                                                      
Out[13]: 
array([[ 4,  3,  7,  8,  0,  6,  5,  2,  9,  1],
       [12, 15, 19, 17, 18, 14, 10, 13, 11, 16]])

I would like to float this again, maybe also for wednesdays meeting. We just added higher dimensional capabilities to choice and permutation and in 1.18 even the axis argument (it is thus brand new).

All of these use the current shuffle logic which is shuffle the subarrays along this axis, instead of shuffle along (individual) axis which I think is arguably what should happen. I.e. it shuffles "over" instead of "along" or within the given axis.

But, in almost all occasions, axis means along axis in NumPy, with hopefully very few exceptions, such as apply_over_axes which has the "over" in the name. So I will be so bold and claim that even renaming the argument to over_axis=0 would be better to avoid confusion! Especially for random numbers where incorrect shuffling may be very hard to notice.

As noted in the github cross-reference above, I have a work-in-progress PR at https://github.com/numpy/numpy/pull/15121. I got some good feedback after submitting the PR, but I haven't made time to address all the issues that were brought up.

@WarrenWeckesser that is cool, what I am personally more urgently concerned about is that we expanded the over meaning in the new API and recently at that.
And I am wondering if we shouldn't pull that partially back, e.g. by at least renaming the axis argument. Or even getting rid of multip-dimensional behaviour completely again for the moment...

I am probably just overreacting right now, because I am a bit annoyed that I missed this or did not think it to the end before... But I honestly think the currentl logic is very dangerous. It is easy to miss that it does not provide the expected along meaning. And it is not the meaning that np.sort uses.

@seberg, thanks for poking this issue. I think we still need to reach consensus on the API. I'll try to give a brief summary of past ideas here. I'll follow your convention of using "over" and "along" for two interpretations of axis. I don't know if we can, at this point, totally undo the existing "along" interpretation of axis for shuffle and permutation, but I think a lot of people would be happy if it turns out we can. :)

At the end of the mailing list discussion several years ago, I ended up thinking the solution was to not change the APIs of shuffle and permutation, and instead introduce two new methods that randomized along the axis instead of over it. One method would work in-place, and the other would return a copy. My preference at the time was for the names permute and permuted, but there were some objections to those names. In the PR from last December, I called them randomly_permute and randomly_permuted, but those names should be considered place-holders. Before trying to decide on those names, we have to decide if adding two new functions is the right approach. From here on, for brevity, I'll refer to the proposed new methods as permute and permuted.

With the new functions, we would have the following related Generator methods:

meaning    operate     return
of axis    in-place     copy
-------    --------  -----------
"over"     shuffle   permutation
"along"    permute   permuted

(The methods that operate "over" the axis, shuffle and permutation, already exist.)

Instead of two new methods, it has been suggested that we have just one, with a parameter that controls the in-place vs. copy behavior. Two suggestions have been floated for this:

(a) Add an out parameter. To work in-place, pass the input array as out. If out is not provided, return a shuffled copy.
(b) Add a boolean flag such as copy or inplace, that specifies the desired behavior.

The main alternative to creating new methods is to add a new parameter to the existing methods that changes how axis is interpreted. Before listing these, I'll reiterate a comment that Robert Kern made in the mailing list thread about how the extra argument is likely to be used in practice (here referring to the independent parameter shown below):

It seems to me a perfectly good reason to have two methods instead of
one. I can't imagine when I wouldn't be using a literal True or False
for this, so it really should be two different methods.

(Editorial digression: Inevitably in discussion like this, the issue of growing the namespace (in this case, the Generator namespace) comes up (sometimes referred to as "namespace pollution"). Let's agree that, yes, all things being equal, a smaller namespace is better. But, like most API design decisions, there are tradeoffs to be considered. If we keep the namespace smaller but create methods with awkward or overly complicated APIs, we're not winning.)

Having said all that, here are two additions to the existing signature of shuffle that have been suggested.

(1) shuffle(x, axis=0, independent=False): The boolean flag independent determines how axis is interpreted: False -> "over", True -> "along". (There are probably better names than independent.)
(2) shuffle(x, axis=0, iaxis=???): A second argument, iaxis, gives the axis for "along" behavior. (How this interacts with axis needs a clear specification. Presumably giving a value for iaxis causes axis to be ignored.)

I think I've covered the all the various API ideas that have come up. If anyone knows of others, let us know.

I am happy with increasing the API here. I am not sure there is much reason to be against it:

  • We can probably agree it is a useful
  • There is no good way to achieve it with existing features
  • using a kwarg for a total behaviour switch seems not a normal pattern, I think Rober Kern was completely right there.

I suppose what is going on here is that shuffle and permutation (and maybe choice) can be compared to an indexing operations (i.e. take), which uses the same meaning for axis. And the reason why it feels a bit strange to me, is probably the downside of this definition that it can never generalize to N-D unlike typical array-aware functions (even indexing itself does if you use arr[..., index]. That is it generalize to stacks of arrays and doing the same operation as before for each individual one).
Note that take_along_axis provides the N-D generalizable "along" meaning for take to N-D correctly (even if it seems complicated). apply_along_axis and apply_over_axis are where I got the "over" from, although I am not sure that "over" is the right word...

I find permutation (which isn't easily changeable but should be shuffled) to be the real outlier here. It is was shuffle-shuffled, permute-permuted then I think things start looking pretty clear and reasonable. Anyone willing to add shuffled and start a deprecation on permutation? permutation is also not very consistent in its behavior with itertools.permutations, FWIW.

I do think permutation, permute, permuted is a confusing triple of similar-sounding names with different behaviors. It would be good (possibly over the long-run) to avoid this.

While it seems simple to extend the existing API, I think @rkern's point about not having keywords that radically change behavior is the best path.

I suppose for in-place vs. not-in-place, we have the alternative out= spelling in NumPy. But since shuffle is in-place that is not a solution and shuffled is nice. It could be for permuted (i.e. permuted(arr, out=arr) means the same as permute(arr), except – unlike shuffle – it will convert to ndarray).
In any case, I like the plan of deprecating permutation in favor of shuffled to clean up the new namespace!

I'm getting back to this issue (and the related PR at https://github.com/numpy/numpy/pull/15121).

Back when I created the original issue, and tried to describe the problem with the current shuffle API, it was pointed out that one way to explain the problem is that most folks will expect the axis argument of shuffle to act the same as the axis argument of sort. The analogy with sort is pretty good, so it might be useful to also look at how we handle the issue of in-place operation vs copying for sorting. The function numpy.sort() accepts an array-like argument and returns a sorted copy. For in-place sorting, one uses the ndarray sort() method. Because it is a method on an existing ndarray, the in-place operation is clear. Over in gh-15121, the argument of the in-place function that randomly permutes its argument must be an ndarray, and not an arbitrary array-like. Otherwise, the function will have to do all the shape discovery that np.array does, and also reject inputs that turn out to be immutable (e.g. we can't do an in-place shuffle of [(1, 2, 3, 4), (5, 6, 7, 8)]).

It would be great if we could truly replicate the sort API, with a function that returns a shuffled copy, and an ndarray method that shuffles in-place, but I don't think adding such a method to the ndarray class has any chance of being accepted.

and an ndarray _method_ that shuffles in-place, but I don't think adding such a method to the ndarray class has any chance of being accepted.

Without a singleton generator I think this would be impossible to achieve.

@bashtage wrote

I find permutation (which isn't easily changeable but should be shuffled) to be the real outlier here. [If it] was shuffle-shuffled, permute-permuted then I think things start looking pretty clear and reasonable. Anyone willing to add shuffled and start a deprecation on permutation?

This is what the mailing list discussion (sort of) converged to back in 2014. Here's a link to Nathaniels suggestion: https://mail.python.org/pipermail/numpy-discussion/2014-October/071364.html

His scramble[d] is what I called randomly_permute[d] in https://github.com/numpy/numpy/pull/15121.

If we add shuffled as a replacement for permutation, and call the new methods that operate along an axis permute[d], the table of related functions is

meaning    operate
of axis    in-place   return copy
-------    ---------  -----------
"over"     shuffle    shuffled
"along"    permute    permuted

which has a nice consistency. In this version of the API, none of the methods have an out parameter.

Over in https://github.com/numpy/numpy/pull/15121, I recently added another method, with the ungainly and obviously temporary name permuted_with_out that demonstrates how the out argument might be used. If we go with an out parameter, and stick with the names of the existing methods shuffle and permutation, the table looks like

meaning    operate
of axis    in-place                           return copy
-------    ---------------------------------  --------------------
"over"     shuffle(x, axis)                   permutation(x, axis)
"along"    permuted_with_out(x, axis, out=x)  permuted_with_out(x, axis)

But if we are going to introduce an out parameter, we should be consistent and use it in permutation, too. And we can still consider replacing permutation with shuffled. And since the new shuffled method has an out parameter, which allows in-place operation, shuffle becomes redundant and can be deprecated along with permutation. Then, switching to the "nice" names of shuffled and permuted, the table is

    meaning    operate
    of axis    in-place                  return copy
    -------    ------------------------  -----------------
    "over"     shuffled(x, axis, out=x)  shuffled(x, axis)
    "along"    permuted(x, axis, out=x)  permuted(x, axis)

Note that the out parameter is not just for operating in-place. It allows an output array to be reused, potentially avoiding the creation of a temporary array. This is an advantage of this API over the shuffle/shuffled/permute/permuted API, but I'm not sure how signficant that advantage really is. The disadvantage of this API is the deprecation of two methods, shuffle and permutation. These can be "soft" deprecations for a while (i.e. deemphasize their use in the docs, but don't actually add a deprecation warning for a while) to lessen the immediate impact.

That's my summary of the two main contenders for the change. We have the shuffle/shuffled/permute/permuted version, or the version with shuffled/permuted with an out parameter. If, back in 2014, someone had jumped in to implement the changes that were discussed, we'd probably have the shuffle/shuffled/permute/permuted version already. But the version using out has a couple (small? insignificant?) advantages: two names instead of four, and out potentially allows a user to have fewer temporary variables. I would be happy with either one.

What do folks think?

Of the three scenarios you listed, in order, I would rank them 1, 3, and quite far behind 2. The 2 permutations that are doing quite radically different things seems like a big source of confusion. My personal preference is to avoid the mandatory use of out to access a feature; I always think of out as a performance choice that can make sense in some scenarios. I would not really like to teach students to use out just to access a feature. I would also assume that in case 3 x = shuffled(x, axis, out=x) would also return x rather than return None, so that while it is in place, one might end up with x appearing 3 times.

My personal preference is to avoid the mandatory use of out to access a feature; I always think of out as a performance choice that can make sense in some scenarios.

But shuffling in place _is_ a performance choice, isn't it?

But shuffling in place _is_ a performance choice, isn't it?

In-place can also be a coding style choice, when available. Perhaps a confusing, and maybe error-prone one.

My personal take is that when f(x, out=x) always feels a bit magical since it is sometimes used as a very non-obvious way to achieve something quick. f(x, inplace=True), despite not looking like anything else, seems much clearer (looks a bit like an old pandas pattern that has mostly been removed).

True, but it is a coding style choice that in NumPy seems typically spelled using out=... (unless you are using an in-place operator or a method). Or maybe it is a coding style choice that NumPy doesn't actively try to make easy in most cases currently...

I admit its a bit magical and an inplace= kwarg may be less magical, but also without real precedence? And I am not sure if the main reason it seems less magical is that the in-place shuffle is at the core of the algorithm here. Algorithmic details should not matter much to most students and in the end it using out= also safes approximately a single copy+the associated memory bandwidth, and is is comparable to ufuncs. (Fair enough, also for ufuncs out=input is maybe somewhat magical, but its common magic and a known pattern – for advanced users.)

While possible a bit tedious to write, and somewhat less quick to read, np.shuffled(x, out=x) seems very clear as to what the behaviour is. The non-obvious part seems only the performance impact, which to me seems like an issue reserved for advanced users to worry about.

A hypothetical question for those advocating the use of out: if we didn't have the existing functions numpy.sort and ndarray.sort, and we were adding a sorting function now, would the preferred API be numpy.sorted(a, axis=-1, kind=None, order=None, out=None) (with no need to implement the method ndarray.sort for in-place sorting)?

ndarray.sort is modelled after list.sort, so it probably is a sensible API choice regardless. That said, I would have been in favor of np.sort not existing, and np.sorted(..., out=...) instead.

Yes, I think np.sort should be named np.sorted (just like python's sorted() after all). Since only the method has the in-place behaviour, I don't see much of a concern though.

I am not sure about the "with no need to implemented the method ndarray.sort". I do not see anything wrong with the method (or its in-place behaviour). The question about the method is merely if we feel it is important enough to provide a convenient method short-hand.
I suppose there is also nothing wrong with having an in-place function version. The not-in-place version just seems nicer to new users and out= pattern common enough to me that advanced users are sufficiently well served.

I am not sure about the "with no need to implemented the method ndarray.sort". I do not see anything wrong with the method (or its in-place behaviour).

That was part of my API thought experiment. I'm didn't mean to imply that there is anything wrong with what we have now. I was just saying that, if we started from scratch--and I'll add to my hypothetical premises that we aren't concerned with matching the Python API for lists--then the preferred API for sorting would be numpy.sorted(..., out=...), and we wouldn't need anything else.

Another question, not so hypothetical: if using out is the preferred option here, then, for API consistency throughout NumPy, should we plan on eventually adding out to numpy.sort, numpy.partition, numpy.argsort, and, well, everything else that doesn't currently have it?

Yes, in my opinion adding an out= kwarg with the same semantics as for ufuncs is a good choice for practically all NumPy API function. Any lack of an out argument is generally an enhancement waiting to be made (although, I guess in practice it may be a small enhancement and in rare cases possibly not worth too much added code complexity).

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  ·  49Comments

sturlamolden picture sturlamolden  ·  68Comments

andyfaff picture andyfaff  ·  65Comments

numpy-gitbot picture numpy-gitbot  ·  49Comments

jakirkham picture jakirkham  ·  55Comments