Design: Document why NaN bits are not fully deterministic

Created on 22 Mar 2016  ·  15Comments  ·  Source: WebAssembly/design

(I'm not currently advocating; this is so we make an informed decision and collect material for a rationale.)

Looking at Nondeterminism.md, the NaN bits parts stand out. Of course threads can race, resources can be exhausted, and features can be added; those are consequences of the overall design. But NaN bits? Could VMs be a little more clever and take care of this? There are two issues:

When an operation gets more than one NaN operand, which does it propagate?

IEEE 754 leaves this unspecified, but at least x86, ARM, and Power all pick the "first" operand, and it's a reasonable choice.

However, fixing the choice at the wasm level would mean that wasm implementations can't commute floating-point adds and multiplies, which is sometimes a useful optimization on pre-VEX x86 where the instructions clobber one of their inputs. Also, it would require anyone using an LLVM-based VM to teach it that add and multiply are not commutative on wasm.

One could reasonably argue these aren't show-stoppers, so this issue is theoretically fixable if there's a strong desire.

When an operation produces a NaN and has no NaN operands, what is the sign bit?

x86 uses 1, ARM uses 0.

The simplest way to fix this would be to canonicalize after every floating-point operation. This is doable, though a trickiness is that both 0 and 1 are possible valid values when there is a NaN operand to propagate, so it wouldn't be enough to just check for a NaN result and canonicalize; one would have to check for a NaN result and for a lack of NaN operands, and only then canonicalize.

Canonicalizing after _every_ operation would be very expensive; another option is to just canonicalize at "escape" points of computations (the canonicalizing path would then have to essentially replay a whole stream of computation in order to determine the correct NaN output). This is an improvement, but likely still adds a significant amount of overhead.

Another possible implementation would be to unmask the invalid exception, take a trap whenever a NaN is generated, and then perform the canonicalization and return. Downsides include using a non-default CPU mode, and being very slow if there are a lot of invalids being generated.

Unfortunately, these approaches carry significant downsides. Unless other ideas surface, or there is a very strong desire, this seems difficult to fix.


One other thing to note is that NaN bits are difficult to accidentally observe, so this isn't a major portability concern in general.

Does anyone else have any thoughts to add?

clarification floating point

Most helpful comment

I'd like to quantify the performance effects of this before making a decision. I think our toolchains are still too immature to make good performance measurements at the moment (there are longer poles in the way of this one). Put another way: I want to avoid "death by a thousand cuts".

All 15 comments

One other thing to note is that NaN bits are difficult to accidentally observe

Is there any way to make them impossible to observe or at least the sign bit?

IMO the NaN nondeterminisms should be left alone, given that it is so rare for software to care about them. What's the case for sacrificing performance to make them deterministic?

Here's a list of the ways one can observe NaN bits:

  • reinterpret conversion
  • store to linear memory and load the bits with a different interpretation (or let the bits be observed externally)
  • pass an argument to a call of an imported function, or return a value from an exported function
  • copysign the sign bit onto a non-NaN

VMs could insert canonicalizing code before each of these; that's the "canonicalize at escape points" idea discussed above. store is very common on hot paths, so this would likely still be fairly expensive.

@qwertie I very much do care about them and I have a usecase that rests on them being deterministic. But as disccussed here there are ways to make it fully determisitic even if it doesn't get in the spec.

@qwertie Benefits might include slightly greater portability (there does exist code in the world that unwisely uses the IEEE 754 totalOrder function, for example), slightly greater reproducibility, and stronger invariants when doing symmetric computations across multiple nodes.

store is very common on hot paths, so this would likely still be fairly expensive.

For this case could we just always specify the sign value? so like just make it 1 if it gets placed in mem?

@wanderer That's essentially what canonicalization entails: check to see if the value is a NaN, and if so, apply some correction. It's typically an extra compare and branch on the hot path.

I should also add, I haven't benchmarked any of the options mentioned in this issue; any benchmarking that anyone can add here would be welcome.

I would strongly lean towards the current level of non-determinism, which favors performance.
I was actually somewhat surprised that wasm rigorously defines the NaN's mantissa bits.
While this may suffice for today's processors, who knows what future processors may come along.

If the wasm is being generated from a high level language, that translator could provide an option to control the level of FP semantics, similar to -ffast-math for example. The translator could then insert any extra wasm fixup needed to coerce nans to a desired format. Towards that end we could provide isNan or normalizeNan operators, though I'm not advocating for this now.

In short, "fixing" this is best left to a higher level tool, IMO.

+1 @mbodart. Edit: oh look there's actually a +1 thingie now :). Btw, I personally think Wasm = world domination and that therefore future processors won't antagonize it.

I'd like to quantify the performance effects of this before making a decision. I think our toolchains are still too immature to make good performance measurements at the moment (there are longer poles in the way of this one). Put another way: I want to avoid "death by a thousand cuts".

I agree with @jfbastien. Performance effects need to be quantified first before tinkering with the semantics.

To be clear, I'm not currently advocating for a change here; I'm collecting rationale material. NaN bit nondeterminism sticks out, and wants explanation. And as far as I'm aware, no one has quantified the performance effects of making these bits nondeterministic either.

ARMv8 does not always propagate the first operand when both operands are NaN. The rule is:

  • If one operand is a quiet NaN and the other is a signaling NaN, propagate the signaling NaN.
  • Otherwise, propagate the first operand.

As I am reading the spec, this applies to both the Aarch32 and Aarch64 modes of ARMv8.

This behavior is different from SSE which propagates the first operand in both cases.

Both architectures will convert an sNaN to a qNaN by setting the quiet bit before propagating it.

@stoklund Good spot! I missed that ARM picking the first NaN doesn't happen in the case where the second NaN is signaling. That would complicate the strategy I layed out for the multiple NaN case above, so we can mention that in the rationale.

I've now created #973 to propose specific text summarizing the above.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nikhedonia picture nikhedonia  ·  7Comments

dpw picture dpw  ·  3Comments

ghost picture ghost  ·  7Comments

beriberikix picture beriberikix  ·  7Comments

jfbastien picture jfbastien  ·  6Comments