Numpy: meaningless RuntimeWarning by clang-compiled np.float32.__mul__

Created on 27 Apr 2017  ·  53Comments  ·  Source: numpy/numpy

In Sagemath we encounter in our ticket #22799

RuntimeWarning: invalid value encountered in multiply

while multiplying a numpy.float32 number with a non-numpy data; i.e. numpy ought to fail to do this multiplication silently, and indeed it does if one builds with gcc, or if instead of np.float32 it is np.float or np.float128.

More precisely, one gets the warning from the Python call

type(numpy.float32('1.5')).__mul__(numpy.float32('1.5'), x)

where x is a Sagemath univariate polynomial with coefficients in Sagemath's RealField type. (and only this particular type of data triggers this).
That is, potentially, such meaningless warnings can be emitted outside of Sagemath; we can reproduce it on OSX 11.12 with its stock cc (some derivative of clang 3.8), as well as on Linux with clang 4.0 and on FreeBSD 11.0 with clang 4.0 or clang 3.7.

Potentially we should be able to produce a way to reproduce this outside of Sagemath, although we'd need some tips where in numpy code this __mul__ is actually implemented, to see what functions are applied to x...

We see this on numpy 1.11 and on 1.12, too.

All 53 comments

same picture with numpy.float32('1.5').__mul__(x) as well as __add__ and __sub__.

This type of error is typical of arrays that contain nan or infinity. What does np.array(x) return?

@eric-wieser : it returns array(x, dtype=object), no warnings.

Does np.multiply(np.float32('1.5'), x) give the same warning?

numpy.multiply(numpy.float32('1.5'), x) gives the same warning.

What about type(x).__rmul__(numpy.float32('1.5'), x)?

Also, if you could run warnings.filterwarnings('error'), then you'd get a full stack trace

type(x).__rmul__(numpy.float32('1.5'), x)

TypeError: descriptor '__rmul__' requires a 'sage.structure.element.Element' object but received a 'numpy.float32'

x.__rmul__(numpy.float32('1.5')) goes through just fine.

Seems I forgot how rmul works. I meant type(x).__rmul__(x, numpy.float32('1.5')), but I imagine that does the same thing as x.__rmul__, unless x is really weird

Does this also fail? np.multiply(np.array(1.5, dtype=object), x) (this time with filterwarnings, please)

type(x).__rmul__(x,numpy.float32('1.5')) goes through without a warning.

And, by the way, setting warnings.filterwarnings('error') does not give me anything interesting,

---------------------------------------------------------------------------
RuntimeWarning                            Traceback (most recent call last)
<ipython-input-50-b3ece847d318> in <module>()
sage: np.multiply(np.array(1.5, dtype=object), x)
---------------------------------------------------------------------------
RuntimeWarning                            Traceback (most recent call last)
<ipython-input-52-706823a0b5a2> in <module>()
----> 1  np.multiply(np.array(RealNumber('1.5'), dtype=object), x)

RuntimeWarning: invalid value encountered in multiply

Hmm, sage did something I didn't expect there. Same behaviour with float('1.5') though, I'd guess?

Ok, so here's what I think is happening:

  • numpy is correctly using the object loop in the ufunc, which ends up just calling PyNumber_Multiply
  • Within sage, something is setting the error flag in the FPU (bug in sage?)
  • numpy is doing its normal fpu flag check on exit from the ufunc (bug in object loops?), and finding the error left their by sage
sage: float(1.5).__mul__(x)
NotImplemented
sage: np.float(1.5).__mul__(x)
NotImplemented
sage: np.float32(1.5).__mul__(x)
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x
sage: np.float64(1.5).__mul__(x)
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x

Worth noting that np.float is float for compatibility reasons

why doesn't np.float32(1.5).__mul__(x) return NotImplemented ?

Because it know that it can handle it as np.multiply with an object loop, and then tries again with float * x inside that loop. Unfortunately, the wrapper around that loop is picking up FPU flags set by sage.

If you look closely, you'll find that x.__rmul__ is still being called deeper in the stack

sage: np.float32(1.5)*x
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x
sage: np.float128(1.5)*x
1.50000000000000*x
sage: np.float64(1.5)*x
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x

so it looks as if np.float128 is OK, but

sage: np.float128(1.5).__mul__(x)
/usr/home/dima/Sage/sage/src/bin/sage-ipython:1: RuntimeWarning: invalid value encountered in multiply
  #!/usr/bin/env python
1.50000000000000*x

This is weird. A compiler bug, perhaps (as we never see it on gcc), but at which place?

The FPU flag is for some reason being set on clang but not gcc within the sage code, it would seem. Numpy is to blame for making noise about it, but I highly doubt is to blame for setting it in the first place.

Unfortunately, numpy doesn't expose any way to explicitly ask for the FPU flags - that'd be very helpful for bisecting the issue within sage.

I assume this causes the same warning (needs numpy 1.12 to do so, I think)

mul = np.vectorize(x.__rmul__)
mul(float('1.5'))

not quite the same, but close:

/usr/home/dima/Sage/sage/local/lib/python2.7/site-packages/numpy/lib/function_base.py:2652: RuntimeWarning: invalid value encountered in __rmul__ (vectorized)
  outputs = ufunc(*inputs)
array(1.50000000000000*x, dtype=object)

OK, good, this seems to indicate that there is nothing specific with np.float32 or np.float64, it's more generic numpy mechanism to generate warnings kicking in here.

I don't know that the compiler authors would consider it a bug. The way that warning works is, there are some magic status flags that the processor keeps track of, that automatically get set whenever the corresponding event occurs. Numpy clears them before starting the computation, and then checks them again again at the end. So somewhere in between those points, the assembly generated by clang is doing some calculation that involves a NaN. But it's hard to track down (since the actual flag setting is done entirely in hardware), and most of the time people don't worry about how their code affects the fpu flags. (Libm implementations are also notoriously inconsistent about whether they set these flags.) And the exact results depend a lot on the exact asm being generated, so it's not surprising that you only see it in specific configurations and not others.

Yep, that confirms my suspicions, and provides you with a way to debug. This code

def check_fpu(f):
    @functools.wraps(f)
    def wrapped(*args, **kwargs):
        excluded = list(range(len(args))) + list(kwargs.keys())
        fvec = np.vectorize(f, excluded=excluded)
        return fvec(*args, **kwargs)
    return wrapped

Applied to a python function, allows you to isolate the warnings to within that chunk of code.

I mean, it is pretty weird that it's happening at all; compilers don't usually invent and then throw away NaNs for no reason.

If you're trying to track it down, then you should probably look at the code in sage that implements multiplication for those polynomials – it's likely that the weird flag setting is probably happening all the time, and numpy's only involvement is to make that visible.

There's also a pretty good argument that numpy shouldn't even try to check these flags on object loops. (Or integer loops for that matter, but that's tricky because the way we report integer overflow is kinda gross and uses the fpu flags.) That's the only thing I can think of that numpy could do here.

check_fpu() has a typo, it should be fvec = np.vectorize(f, excluded=exclude) there.
And, we're on python2: import functools32 as functools.

functools.wraps doesn't need python 3, does it?

I get an error if I use python2's functools, in setattr() call

AttributeError: 'method-wrapper' object has no attribute '__module__'

Yeah, I'm guessing it's whatever multi-precision library implementing the arithmetic for the coefficients in RealField that is setting the FPU flag. Are the underlying libraries compiled with the same compiler as numpy in each of the different circumstances? Or is only numpy being rebuilt with the different compilers?

Yeah, I'm guessing it's whatever multi-precision library implementing the arithmetic for the coefficients in RealField

That's MPFR for the record.

We try to port Sagemath to clang+gfortran (mostly on OSX and FreeBSD, platforms where clang is the primary compiler), so that building and running it on OSX is easier and faster (FreeBSD is more of a tool to get a similar environment without the hassle of OSX and Apple hardware).

All the comparisons I report here are for complete builds with clang/clang+++gfortran as opposed to gcc/g+++gfortran.

the wrapper seems to tell us that x.__rmul__ sets up the FPU flag

check_fpu(x.__rmul__)(np.float32('1.5'))

prints the warning, while x.__rmul__(np.float32('1.5')) does not.

Indeed - my assumption was that x.__rmul__ was written in python, and that its source code can be bisected to find which bit specifically sets the flag

x.__rmul__ is in Cython, but still it's a small piece of code to investigate.

If there would be a simple way to change the warning to an error, you would get a traceback (Cython generates tracebacks for errors but not for warnings).

@jdemeyer IMHO numpy warning is issued much later in the code path, i.e. it's result of an explicit check of FPU flags, not an interrupt set.

numpy does provide an interface to change this warning to an error, but all you get is that you get back to the main iPython interpreter loop, without any backtrace whatsoever.

@jdemeyer would Cython code between sig_on() / sig_off() from cysignals throw an exception if an FPU flag is raised? Or it depends upon the flag?

cysignals would throw an exception is SIGFPE is raised, which may happen if an FPU flag is raised, depending of the FPU configuration. But by default it does not.

A similar warning: RuntimeWarning: invalid value encountered in greater is
coming from np.float64(5)>e. Here e is the Sagemath constant specifying the base of natural logarithm 2.71828..., and so on the way to evaluating this to True it has to be "converted" (certainly, e "knows" its numerical approximation, it's e.n()) to a number.
This approximation is of the type RealField already mentioned above (so perhaps this warning is closely related).

Again, the question is: what does numpy do to evaluate np.float64(5)>e ?
Or equivalently, the same warning pops up from np.float64(5).__gt__(e), so one can just as well start from there.

Note that type(e) is sage.symbolic.constants_c.E; it's basically some (almost) dummy class
wrapping Sagemath's symbolic expressions (SR).

There are no warnings from np.float64(5).__gt__(e.n()) or np.float64(5)>e.n().
Essentially the same (same warning/no warning pattern) happens if you replace e by pi (with obvious pi.n()==3.1.415...).
pi has type SR, i.e. sage.symbolic.expression.Expression.

The answer is the same here - numpy calls np.greater with an object loop. At the bottom level, that calls e.__lt__(5.0). But once again, it checks the FPU flags before and after, and notices that something is amiss.

Most of the ndarray arithmetic/logical operators (with the exception of - and divmod) delegate to ufuncs. When invoked with sage objects, this will call the O (object) loops for these ufuncs. These object loops will loop over the array (which in this case is 0d), run the normal python operator on the elements, _but check for FPU flags when it does so_.

So once again, sage is setting these flags. Perhaps this is a sign of a bug, perhaps its not.

I think there's a good argument here that numpy should not be checking fpu flags for these cases. @njsmith, do you think we should go ahead with removing the check for object types?

As a matter of fact, e.__lt__(5.0) is a symbolic expression:

sage: type(e.__lt__(np.float32(5.0)))
<type 'sage.symbolic.expression.Expression'>
sage: e.__lt__(np.float32(5.0))
e < 5.0
sage: bool(e.__lt__(np.float32(5.0)))  # this is how it's evaluated
True

and thus I really doubt that it is called in the end, for one does get True. Also, your check_fpu wrapper from above does not make it print warnings, i.e. the following just works.

sage: check_fpu(e.__lt__)(np.float32(5.0))
e < 5.0

I was able to pin our problem in Sagemath down to a particular C extension using fpectl Python module (which is somewhat, but not totally, broken on FreeBSD). It was actually very quick once I managed to get it installed.

IMHO fpectl is so useful that it ought to be fixed; perhaps even used in numpy instead of, or in addition to, np.seterr(), as it provides better granularity on compiled components.

The difference between fpectl's approach and np.seterr is:

np.seterr runs the ufunc loop, and then checks to see if any flags are set.

fpectl does some magic to make it so that whenever an operation occurs that causes one of the flags to be set, the hardware raises an interrupt, the kernel converts this into a SIGFPE delivered to the process, and it installs an SIGFPE handler that longjmps directly out of the signal handler into the error handling code.

Some downsides of the fpectl approach are: (a) it doesn't work at all on Windows, (b) it breaks for code that internally causes one of these flags to be temporarily set and then clears it (this is legal and I expect there are libm's that do it), (c) longjmp is incredibly fragile; basically you're risking a segfault every time you do it. It certainly can't be a general solution for arbitrary user-defined ufuncs.

Given all this, I don't think numpy is going to switch.

In any case, it sounds like the original issue is solved, so closing this – feel free to open a new issue if you want to make a case for changes in seterr.

In any case, it sounds like the original issue is solved

Are we sure we don't want to disable checking of the FPU flags for object loops? That would seem like a pretty sensible change to numpy.

@eric-wieser: oh, that's an interesting idea, yeah. maybe it's worth opening an issue for that :-). The "right thing" is pretty complicated though – ideally we shouldn't be special-casing the object dtype (think user dtypes), and integer loops also shouldn't use it either (this may be a real optimization on some architectures where checking/clearing the FPU flags is extremely slow), but integer loops do need a way to explicitly signal integer errors, which they currently do by explicitly setting the FPU flags, ... I'm not sure this is a case where there's easy low-hanging fruit?

Or did I misunderstand, and sage has only identified the problem, and they still need a numpy change to actually fix it?

@njsmith: I do not understand why you say it won't work on Windows. (This would be correct in pre-C99 era, though). Modern FPU-handling functions (fenv) are available as soon as your C compiler is C99-standard compliant. Apart from fenv all it needs is setjmp/longjmp (again, standard C feature).

I am also curious to hear about a libm that causes one of FE exceptions in the course of a normal operation.
(Unless it is classified as a bug).

@dimpase: You also need SIGFPE support, which is not specified in C99. (Well, C99 says that there should be a SIGFPE, but that's for divide-by-zero – it doesn't specify any way to hook it up to floating point exceptions.) That said, it looks like I misremembered, and though Windows doesn't support signals, MSVCRT emulates SIGFPE using structured exception handling, and provides the non-standard _control_fp function to enable it for particular fp exceptions, so Windows support is not actually a barrier. OTOH it doesn't matter that much since longjmp is definitely not happening without an extremely good reason :-)

And FWIW, if a libm caused an FE exception and then cleared it again, I can't see why they would consider that a bug. I'm not sure that any such implementations exist, but it's plausible, and if they do then the way we would find out is b/c someone tells us that numpy is broken on platform X and the only fix would be to revert the change you suggested.

Can you answer the the question I asked at the end of my previous comment?

@njsmith : if a libm (or any other user code) needs to cause an FE exception and process it, it would set up its own FE exception handler, saving the previous one, and restoring the previous one upon exiting.
So it is not a problem if the underlying code is playing by the rules.

Regarding MS support for this, they ship fenv.h since Visual C(++) 2013 or so.
It is specifically meant to be useful with setjmp/longjmp (how exactly it is done under the hood should not worry one too much, I hope).

Regarding numpy's RuntimeWarning:

  • if you think that the user code can play fast and loose with FP flags, then these warnings should be optional at most.
  • else, they can be kept standard, but at least a way to get to the bottom of the place they come from is crucial for them being useful; (improved) fpectl is a quick way to achieve the latter. Using external tools (such as debbuggers allowing you to instrument the code to checking something after every instruction) is harder, slower, and more error-prone---e.g. it's not-unusual that the bug only pops up on an optimised binary, and goes away as soon as you try to find it on a well-debuggable binary.
  • in any case, the RuntimeWarning thing ought to be possible to turn off.

Regarding this issue in Sage - still fixing (hopefully it's limited to some issues in MPFR only).

Sorry, this is going in circles and I need to move on to other things, so unless something new comes up on the fenv/sigfpe issue this will be my last message on the topic. (I'm still interested in if there's anything numpy needs to do for the sage bug).

if a libm (or any other user code) needs to cause an FE exception and process it, it would set up its own FE exception handler, saving the previous one,

What you're proposing is to take an operation that normally does not cause a signal handler to fire, and configure the processor in a non-standard mode where it does cause a signal handler to fire. It's totally reasonable for code to be performing this operation and expecting that it won't trigger a signal handler at all.

Regarding MS support for this, they ship fenv.h since Visual C(++) 2013 or so.
It is specifically meant to be useful with setjmp/longjmp

I can't figure out what you're talking about here. Afaict, the standard functionality in fenv.h is only useful for implementing numpy-style functionality, and MS sticks to the standard. I don't see any functions in there that could be used with setjmp/longjmp at all.

user code can play fast and loose with FP flags, then these warnings should be optional at most.

Carefully clearing a flag set by an intermediate calculation is the exact opposite of playing fast and loose with them. Also, the warnings are optional.

a way to get to the bottom of the place they come from is crucial for them being useful; (improved) fpectl is a quick way to achieve the latter.

You're literally the first person in something like a decade to need SIGFPE to debug this kind of issue, and looking again at the sage bug comments, it looks like you didn't actually get fpectl working? It's not supposed to cause a core dump. (It looks like cysignals is overriding the fpectl code so it doesn't even run.)

If this comes up again, what you need to do is make one C call to enable SIGFPE, then use a debugger to get a stack trace. You don't need a debug build to get a stack trace; all you need to do is not strip the symbols. And hey, now we know in case this does come up again.

I understand this was really frustrating to debug, but it's not helpful to insist that other projects need to change or maintain basic infrastructure when you can't even explain clearly what this will accomplish. (I actually have no idea how you think numpy changing something here would even help you find this kind of bug faster – the whole idea of longjmp is to destroy the more precise information you say you want.)

Finally, it turns out that it boils down to a long-standing bug in clang C compiler. Basically, in a certain range of double an assignment as in

unsigned long a;
double b;
...
a = b;

raises FE_INVALID (and sometimes FE_INEXACT); b.t.w. this is also affecting other types of float data. Great MPFR (MPFR has to copy doubles into their arbitrary precision floats, certainly) people provided a workaround, fixing this for sage.

Incidentally, a related even more long-standing (since 2010, with a dozen closed duplicates) clang bug 8100 says that there is no hope of using clang's fenv.h for making fpectl properly working at the moment. Clang is not fully C99-compliant in this respect, and is being really shy about it.

Not sure whether numpy wants to do anything about it; perhaps a remark that the RuntimeWarning might be merely due to a compiler bug (citing "clang bug 8100", it is a quite prominent example) might be helpful.

bug 8100 isn't relevant; that's for the C99 pragmas to disable floating point optimizations, and no mainstream compilers support those. numpy seems to (mostly) work anyway :-)

The spirit of the bug 8100 is that clang does not care about FP operations being correctly compiled; although a lawyer might disagree. :-)

OK, already mentioned bug 17686 is relevant for sure.

Was this page helpful?
0 / 5 - 0 ratings