Numpy: "Controlled" creation of object arrays

Created on 4 Dec 2019  ·  45Comments  ·  Source: numpy/numpy

Auto-creation of object arrays was recently deprecated in numpy. I agree with the change, but it seems a bit hard to write certain kinds of generic code that determine whether a user-provided argument is convertible to a non-object array.

Reproducing code example:

Matplotlib contains the following snippet:

    # <named ("string") colors are handled earlier>
    # tuple color.
    c = np.array(c)
    if not np.can_cast(c.dtype, float, "same_kind") or c.ndim != 1:
        # Test the dtype explicitly as `map(float, ...)`, `np.array(...,
        # float)` and `np.array(...).astype(float)` all convert "0.5" to 0.5.
        # Test dimensionality to reject single floats.
        raise ValueError(f"Invalid RGBA argument: {orig_c!r}")

but sometimes the function is called with an array of colors in various formats (e.g. ["red", (0.5, 0.5, 0.5), "blue"]) -- we catch the ValueError and convert each item one at a time instead.

Now the call to np.array(c) will emit a DeprecationWarning. How can we work around that? Even something like np.min_scalar_type(c) emits a warning (which I would guess it shouldn't?), so it's not obvious to me how to check "if we converted this thing to an array, what would the dtype be?"

Numpy/Python version information:


1.19.0.dev0+bd1adc3 3.8.0 (default, Nov 6 2019, 21:49:08)
[GCC 7.3.0]

57 - Close?

Most helpful comment

Could someone point to the operator.mod example?

As for the == operator, the one I saw was doing something like np.array(vals, dtype=object) == vals where vals=[1, [2, 3]] (paraphrasing the code), so the solution is to proactively create the array on the right side.

Many of the scipy failures seem to be of the form np.array([0.25, np.array([0.3])]), where mixing scalars and ndarray with shape==(1,) will fall afoul of the dimension discovery and create an object array. xref gh-15075

All 45 comments

One option would be
```python
try:
# get ahead of the game and promote the deprecation to an error that will replace it
with warnings.catch_warnings():
warnings.filterwarnings('raise', DeprecationWarning, message="...")
c_arr = np.asarray(c)
except (DeprecationWarning, ValueError):
# whatever you currently do for ValueError

I guess this, and the failing test mentioned in gh-15045 are instances where emitting a DeprecationWarning for a few years instead of directly emitting a ValueError causes more code churn than needed.

Note that warnings.catch_warnings is not threadsafe. That makes the workaround a bit prone to followup issues down the line.

I think that the code-churn is worth the deprecation period.

Matplotlib runs it's test suite with warnings-as-failures to catch exactly this sort of change early so this seems like the system working to me :).

But AFAICT there isn't even a reasonably easy fix (as pointed above the proposed fix is not threadsafe) for it :/

I think I see @anntzer's point here. We're in a mess where downstream library want to fail fast so they can try something else, while users should be shown a gentler message.

The problem is today there is no way for the library author to ask "would this emit a warning" without actually... emitting the warning, and suppressing it isn't threadsafe.

Regarding warning thread-safety: https://bugs.python.org/issue37604

AFAIK, the deprecation is in the release branch. Do we want to revert it? If not, the fixes will need backports. I'm still not clear why the warnings were not raised in the release branch wheels and didn't show up in the nightly builds until the last two builds. I didn't change anything after the branch and nothing looks very suspicious in the commits since then in the master branch except, possibly, #15040.

IMHO (and in agreement with @mattip's point above) it's the kind of changes that would be much easier to handle downstream if the switch to raising happened without a deprecation period. Not sure that's an option though :/

Or possibly multibuild treats branches are differently than master.

FWIW I was always at least -1 on this change, especially as a keen user of ragged data structures, but anyway now I need to figure out what to do about the hundreds of test failures for the SciPy 1.4.0rc2 prep in https://github.com/scipy/scipy/pull/11161

now I need to figure out what to do about the hundreds of test failures

An easy option would be:

  • Suppress the warning in your pytest config
  • Open an issue to fix it later

The whole point in us using DeprecationWarning instead of ValueError was to give downstream projects and users a grace period to do exactly that.

AFAIK, the deprecation is in the release branch. Do we want to revert it?

I think we do, it's raining issues. We now have a list of what's breaking in Pandas, Matplotlib, SciPy, inside numpy.testing and NumPy ufuncs, ==, etc. I think we should revert the change now and go assess/fix all those things, then reintroduce the deprecation.

Can we compromise on a pendingdeprecationwarning?

That way, downstream projects can add it to their ignore lists, and when we switch back to DeprecationWarning they get to make the decision again.

We seem to have diverged from the original issue, which seems to be "given a sequence of values, how can matplotlib determine if they are a single color or a list of colors". I think there should be a solution that does not require casting the values to an ndarray, and checking the dtype of that array. Some kind of recursive is_a_color() function might be a better solution.

I've reverted the change for 1.18.x in #15053.

The sentiment is that breaking scipy and pandas CI is annoying enough to temporarily revert it in master as well. I would like it to go back in basically scheduled (say within a month) though. We may need to find a solution though. Also the fixups pandas are doing are slightly worrisome to me, since they use catch_warnings.

If there is really no way, and we need thread-safe warning suppression. np.seterr could possibly hold a slot for it :/.

We seem to have diverged from the original issue, which seems to be "given a sequence of values, how can matplotlib determine if they are a single color or a list of colors".

I think the issue @anntzer brings up is more general though. It's about writing a function that takes many types of input, with logic like:

  • create ndarray(flexible_input)
  • if `new_ndarray.dtype.kind == 'O': handle this
  • else: use_the_array

since one can't add dtype=object to such code, what should be done?

Also the fixups pandas are doing are slightly worrisome to me, since they use catch_warnings.

@seberg wasn't suppress_warnings better for this?

@rgommers no, suppress_warnings solved the issue of warnings suppression being permanent when it should not be. That has been fixed on newer python versions though, so that we do not really need it anymore (it has better properties, since it supports nesting, but it does not support thread safety. I am not sure that is possible outside python, and even if it was, it is probably not desirable)

Not completely sure if the problematic cases run agaist the original intention (https://numpy.org/neps/nep-0034.html) of they we're just not anticipated.

Anyway, a way out would be to explicitly enable the old behavior along the lines of "appreciating your concern, but we explicitly want the context-dependent object dtype and will handle problematic input ourselves". Something like one of

~~~
np.array(data, dtype='allow_object')

np.array(data, allow_object_dtype=True)

with np.array_create_allow_object_dtype():
np.array(data)
~~~

all not very pretty and naming for sure to be improved. But this gives a clean way out for libraries which relied on the behavior and want to keep it (at least for the moment).

Isn't the matplotlib case actually:

with np.forbid_ragged_arrays_immediately():
    np.array(data)

since really you want to catch the error, rather than getting an object dtype?

There is no reversion of the deprecation currently pending for master. I don't think it should be reverted wholesale as it was in 1.18 because that also removed the fixes, which I think we want to keep. @mattip A more targeted reversion would be appreciated until we decide what to do in the long term.

FWIW I think most of the places in mpl which hit this can be fixed (with more or less restructuring -- in one case it turns out the code if much faster after...).
I think @timhoffm's proposed API would be nicer than a with np.forbid_ragged_arrays_immediately: because the latter can easily be written in terms of the former (raise if np.array(..., allow_object=True).dtype == object) whereas the opposite (try: with np.forbid: ... except ValueError: ...) would be less efficient if we still want to create an object array after all. But a CM (just "locally moving past the deprecation period") would be better than nothing.

(Again, I think the change is a good one, it's just a matter of how it's executed.)

Yeah, we just need to figure out how the API should look like. As pointed out by many, there are currently two main issues:

  1. Confounding object and "allow ragged". If the objects have a reasonable type (say Decimal) you actually want to get the warning/error but also may need to pass dtype=object
  2. There is neither a way to Opt-In into the new behaviour or keep using the old (without a warning). It seems at least Opt-In is likely necessary for internal usage, if we do not provide it, we basically assume that (possibly indirectly) it is only end users who run into these cases?

Finally, we have to figure out how to cram it into our code :). ndmin may be another target to cram in fllags controlling the ragged behaviour at least.

There is no reversion of the deprecation currently pending for master. I don't think it should be reverted wholesale as it was in 1.18 because that also removed the fixes, which I think we want to keep. @mattip A more targeted reversion would be appreciated until we decide what to do in the long term.

I don't see a problem with a full revert and then reintroduce whatever parts make sense now. Again, reverting something is not a value judgement about what is good or bad, it's just a pragmatic way to unbreak a bunch of stuff we just broke by pushing the merge button. There's clearly impact and unsolved issues that were not foreseen in the NEP, so reverting first is the right thing to do.

An argument for not reverting yet - while the change is in master, we can leverage downstream CI runs to try and work out what their workarounds would look like

Downstream CI is red, that's _very_ unhelpful. We now have their list of failures, we don't need to keep their CI's red to make our life a little easier here.

And at least Matplotlib's CI is running against pip install --pre not master branch

And at least Matplotlib's CI is running against pip install --pre not master branch

That's pulling from the nightly wheels it looks like. The change was already reverted for 1.18.0rc1, so you shouldn't see it if you would be installing with --pre from PyPI.

Some of the above comments amount to rethinking the proposed changes in NEP 34. I'm not sure if this thread is the appropriate place to continue this discussion, but here goes. (No harm if it should be discussed elsewhere--copying and pasting comments is easy. :smile: Also, some of you have seen a variation these comments in a discussion on slack.)

After thinking about this recently, I ended up with the same idea as @timhoffm's first suggestion (and the idea has probably been proposed at other times in the last few months): define a specific string or singleton object that, when given as the dtype argument to array, allows the function to handle ragged-shaped input by creating a 1-d object array. In effect, this enables the pre-NEP-34 behavior of dtype=None in which ragged-shaped input is automatically converted to an object array. If any other value for dtype is given (including None or object), a deprecation warning is given if the input is ragged-shaped. In a future version of NumPy that warning will be converted to an error.

I think it is clear now that using dtype=object to enable the handling of ragged-shaped input is not a good solution to the problem. Ideally, we would decouple the notions of "object array" from "ragged array". But we can't completely decouple them, because when we want to handle a ragged array, the only choice we have is to create an object array. On the other hand, sometimes we want an object array, but we don't want the automatic conversion of ragged-shaped input to an object array of sequences.

For example (cf. item 1 in @seberg's last comment), suppose f1, f2, f3 and f4 are Fraction objects, and I am working with object arrays of Fractions. I'm not interested in creating a ragged array. If I accidentally write a = np.array([f1, f2, [f3, f4]], dtype=object), I _want_ that to generate an error, for all the reasons that we have NEP 34. With NEP 34, however, that will create a 1-d array of length 3.

Alternatives that add a new keyword argument, such as @timhoffm's second suggestion, seem more complicated than necessary. The problem that we're trying to solve is the "foot gun" where ragged input is automatically converted to a 1-d object array. The problem only arises when dtype=None is passed to array. Requiring users to replace dtype=None with dtype=<special-value-that-enables-ragged-handling> to maintain the old troublesome behavior is simple change to the API that is easy to explain. Do we really need any more than that?

I think it is clear now that using dtype=object to enable the handling of ragged-shaped input is not a good solution to the problem. Ideally, we would decouple the notions of "object array" from "ragged array".

Sounds reasonable, maybe. It's also good to point out that there is no real "ragged array" concept in NumPy. It's something we basically don't support (search for "ragged" in the docs, on the issue tracker or mailing list to confirm if you want), it's something that DyND and XND support, and we only started talking about to have a concise phrase to discuss "we want to remove the np.array([1, [2, 3]]) behavior that trips up users". Hence baking in "ragged arrays" as a new API thing should be done with extreme caution, it's absolutely not something we want to promote. So would be good to make that clear in the naming of whatever dtype=some_workaround we may add.

It seems general opinion is coalescing around a solution of extending the deprecation (maybe indefinitely) by allowing np.array(vals, dtype=special) which will behave like before NEP 34. I prefer a singleton rather than a string, since it means library uses can do special = getattr(np.special, None) and their code will work across versions.

Now we need to decide upon the name and where it should be exposed. Perhaps never_fail or guess_dimensions? As for where to expose it, I would prefer not to hang it off np rather some other internal module, maybe with a _ to indicate it is really a private interface.

I think the path forward is to amend NEP 34, then expose the discussion on the mailing list.

Note that there have been a couple of reports also of problems with using operators (== and operator.mod at least). Are you proposing to ignore that, or to somehow store that state on the array?

In almost all cases it is probably known that one of the operands is a numpy array. So it should probably be possible to get well defined behaviour by manually converting to a numpy array.

Could someone point to the operator.mod example?

As for the == operator, the one I saw was doing something like np.array(vals, dtype=object) == vals where vals=[1, [2, 3]] (paraphrasing the code), so the solution is to proactively create the array on the right side.

Many of the scipy failures seem to be of the form np.array([0.25, np.array([0.3])]), where mixing scalars and ndarray with shape==(1,) will fall afoul of the dimension discovery and create an object array. xref gh-15075

Could someone point to the operator.mod example?

Saw that in @jbrockmendel's Pandas PR, but I think it has since change (don't see an explicit operator.mod in the comments anymore).

As for the == operator, the one I saw was doing something like np.array(vals, dtype=object) == vals where vals=[1, [2, 3]] (paraphrasing the code), so the solution is to proactively create the array on the right side.

At that point it becomes np.array(vals, dtype=object) == np.array(vals, dtype=object), so better just delete the test:)

@mattip wrote:

I prefer a singleton rather than a string, since it means library uses can do special = getattr(np.special, None) and their code will work across versions.

That sounds OK to me.

Now we need to decide upon the name and where it should be exposed. Perhaps never_fail or guess_dimensions? As for where to expose it, I would prefer not to hang it off np rather some other internal module, maybe with a _ to indicate it is really a private interface.

My current working name for this is legacy_auto_dtype, but there are probably many other names that I would have no complaints about.

I'm not sure the name should be private. By any practical definition of _private_ and _public_, this will be a _public_ object. It provides users the means to preserve the legacy behavior of, for example, array(data) by rewriting that as array(data, dtype=legacy_auto_dtype). I imagine the updated NEP will explain that this is how code should be modified to maintain the legacy behavior (for those who must do so). If that is the case, the object is definitely not private. In fact, it seems it is a public object that will remain in NumPy indefinitely. But perhaps my understanding of how the modifed NEP 34 will play out is wrong.

Agreed wtih @WarrenWeckesser's description of public/private; either it's public, or it shouldn't be used by anyone outside of NumPy.

Re name: please pick a name that describes the functionality. Things like "legacy" are almost never a good idea.

please pick a name that describes the functionality.

auto_object, auto_dtype, auto ?

Thinking out loud for a bit...

What does this object do?

Currently, when NumPy is given a Python object that contains subsequences whose lengths are not consistent with a regular n-d array, NumPy will create an array with object data type, with the objects at the first level where the shape inconsistency occurs left as Python objects. For example, array([[1, 2], [1, 2, 3]]) has shape (2,), np.array([[1, 2], [3, [99]]]) has shape (2, 2), etc. With NEP 34, we are deprecating that behavior, so attempting to create an array with "ragged" input will eventually result in an error, unless it is explicitly enabled. The special value that we're talking about enables the old behavior.

What is a good name for that? ragged_as_object? inconsistent_shapes_as_object?

At that point it becomes np.array(vals, dtype=object) == np.array(vals, dtype=object), so better just delete the test:)

Well, I was paraphrasing. The actual test is more like my_func(vals) == vals should become my_func(vals) == np.array(vals, dtype=object)

I will propose an extension to NEP 34 to allow a special value for dtype.

Note that it seems scipy does not need this sentinel to pass tests with scipy/scipy#11310 and scipy/scipy#11308

gh-15119 was merged, which re-implemented the NEP. If it is not reverted, we can close this issue

I am going to close this, since we did not followup on it before the 1.19 release. And I at least hope the reason for this was because the discussion has died down since all major projects were able to find reasonable solutions to the problems created by it.
Please correct me if I am wrong, especially if this is still prone to issues with pandas, matplotlib, etc. But I assume we would have heard of that during the 1.19.x release candidate cycle.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dcsaba89 picture dcsaba89  ·  3Comments

navytux picture navytux  ·  4Comments

marcocaccin picture marcocaccin  ·  4Comments

kevinzhai80 picture kevinzhai80  ·  4Comments

dmvianna picture dmvianna  ·  4Comments