Design: Please Support Arbitrary Labels and Gotos.

Created on 8 Sep 2016  ·  159Comments  ·  Source: WebAssembly/design

I'd like to point out that I haven't been involved in the web assembly effort,
and I'm not maintaining any large or widely used compilers (just my own
toy-ish language, minor contributions to the QBE compiler backend, and an
internship on IBM's compiler team), but I ended up getting a bit ranty, and
was encouraged to share more widely.

So, while I'm a bit uncomfortable jumping in and suggesting major changes
to a project I haven't been working on... here goes:

My Complaints:

When I'm writing a compiler, the first thing that I'd do to with the high level
structure -- loops, if statements, and so on -- is validate them for semantics,
do type checking and so on. The second thing I do with them is just throw them
out, and flatten to basic blocks, and possibly to SSA form. In some other parts
of the compiler world, a popular format is continuation passing style. I'm not
an expert on compiling with continuation passing style, but it neither seems to
be a good fit for the loops and scoped blocks that web assembly seems to have
embraced.

I'd like to argue that a flatter, goto based format would be far more useful as
a target for compiler developers, and would not significantly hinder the
writing of a usable polyfill.

Personally, also I'm not a big fan of nested complex expressions. They're a bit
clunkier to consume, especially if inner nodes can have side effects, but I
don't strongly object to them as a compiler implementer -- The web assembly
JIT can consume them, I can ignore them and generate the instructions that map
to my IR. They don't make me want to flip tables.

The bigger problem comes down to loops, blocks, and other syntactic elements
that, as an optimizing compiler writer, you try very hard to represent as a
graph with branches representing edges; The explicit control flow constructs
are a hindrance. Reconstructing them from the graph once you've actually done
the optimizations you want is certainly possible, but it's quite a bit of
complexity to work around a more complex format. And that annoys me: Both the
producer and the consumer are working around entirely invented problems
which would be avoided by simply dropping complex control flow constructs
from web assembly.

In addition, the insistence of higher level constructs leads to some
pathological cases. For example, Duff's Device ends up with horrible web
assembly output, as seen by messing around in The Wasm Explorer.
However, the inverse is not true: Everything that can be expressed
in web assembler can be trivially converted to an equivalent in some
unstructured, goto based format.

So, at the very least, I'd like to suggest that the web assembly team add
support for arbitrary labels and gotos. If they choose to keep the higher
level constructs, it would be a bit of wasteful complexity, but at least
compiler writers like me wold be able to ignore them and generate output
directly.

Polyfilling:

One of the concerns I have heard when discussing this is that the loop
and block based structure allows for easier polyfilling of web assembly.
While this isn't entirely false, I think that a simple polyfill solution
for labels and gotos is possible. Whiie it might not be quite as optimal,
I think that it's worth a little bit of ugliness in the bytecode in order
to avoid starting a new tool with built in technical debt.

If we assume an LLVM (or QBE) like syntax for web assmembly, then some code
that looks like:

int f(int x) {
    if (x == 42)
        return 123;
    else
        return 666;
}

might compile to:

 func @f(%x : i32) {
    %1 = test %x 42
jmp %1 iftrue iffalse

 L0:
    %r =i 123
jmp LRet
 L1:
    %r =i 666
jmp LRet
 Lret:
    ret %r
 }

This could be polyfilled to Javascript that looks like:

function f(x) {
    var __label = L0;
    var __ret;

    while (__label != LRet) {
        switch (__label) {
        case L0:
            var _v1 = (x == 42)
            if (_v1) {__lablel = L1;} else {label = L2;}
            break;
        case L1:
            __ret = 123
            __label = LRet
            break;
        case L2;
            __ret = 666
            __label = LRet
            break;
        default:
            assert(false);
            break;
    }
}

Is it ugly? Yeah. Does it matter? Hopefuly, if web assembly takes off,
not for long.

And if not:

Well, if I ever got around to targeting web assembly, I guess I'd generate code
using the approach I mentioned in the polyfill, and do my best to ignore all of
the high level constructs, hoping that the compilers would be smart enough to
catch on to this pattern.

But it would be nice if we didn't need to have both sides of the code generation
work around the format specified.

control flow

Most helpful comment

The upcoming Go 1.11 release will have experimental support for WebAssembly. This will include full support for all of Go's features, including goroutines, channels, etc. However, the performance of the generated WebAssembly is currently not that good.

This is mainly because of the missing goto instruction. Without the goto instruction we had to resort to using a toplevel loop and jump table in every function. Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function. The relooper can not help with this, only a goto instruction can.

It is awesome that WebAssembly got to the point where it can support a language like Go. But to be truly the assembly of the web, WebAssembly should be equally powerful as other assembly languages. Go has an advanced compiler which is able to emit very efficient assembly for a number of other platforms. This is why I would like to argue that it is mainly a limitation of WebAssembly and not of the Go compiler that it is not possible to also use this compiler to emit efficient assembly for the web.

All 159 comments

@oridb Wasm is somewhat optimized for the consumer to be able to quickly convert to SSA form, and the structure does help here for common code patterns, so the structure is not necessarily a burden for the consumer. I disagree with your assertion that 'both sides of the code generation work around the format specified'. Wasm is very much about a slim and fast consumer, and if you have some proposals to make it slimmer and faster then that might be constructive.

Blocks that can be ordered into a DAG can be expressed in the wasm blocks and branches, such as your example. The switch-loop is the style used when necessary, and perhaps consumers might do some jump threading to help here. Perhaps have a look at binaryen which might do much of the work for your compiler backend.

There have been other requests for more general CFG support, and some other approaches using loops mentioned, but perhaps the focus is elsewhere at present.

I don't think there are any plans to support 'continuation passing style' explicitly in the encoding, but there has been mention of blocks and loops popping arguments (just like a lambda) and supporting multiple values (multiple lambda arguments) and adding a pick operator to make it easier to reference definitions (the lambda arguments).

the structure does help here for common code patterns

I'm not seeing any common code pattern that are easier to represent in terms of branches to arbitrary labels, vs the restricted loops and blocks subset that web assembly enforces. I could see a minor benefit if there was an attempt to make the code closely resemble the input code for certain classes of langauge, but that doesn't seem to be a goal -- and the constructs are a bit bare if they were there for

Blocks that can be ordered into a DAG can be expressed in the wasm blocks and branches, such as your example.

Yes, they can be. However, I'd strongly prefer not to add extra work to determine which ones can be represented this way, versus which ones need extra work. Realistically, I'd skip doing the extra analysis, and always just generate the switch loop form.

Again, my argument isn't that loops and blocks make things impossible; It's that everything they can do is simpler and easier for a machine to write with goto, goto_if, and arbitrary, unstructured labels.

Perhaps have a look at binaryen which might do much of the work for your compiler backend.

I already have a serviceable backend that I'm fairly happy with, and plans to fully bootstrap the entire compiler in my own language. I'd rather not add in a rather large extra dependency simply to work around the enforced use of loops/blocks. If I simply use switch loops, emitting the code is pretty trivial. If I try to actually use the features present in web assembly effectively, instead of doing my damndest to pretend they don't exist, it becomes a good deal more unpleasant.

There have been other requests for more general CFG support, and some other approaches using loops mentioned, but perhaps the force is elsewhere at present.

I'm still not convinced that loops have any benefits -- anything that can be represented with a loop can be represented with a goto and label, and there are fast and well known conversions to SSA from flat instruction lists.

As afar as CPS goes, I don't think that there needs to be explicit support -- it's popular in FP circles because it's fairly easy to convert to assembly directly, and gives similar benefits to SSA in terms of reasoning (http://mlton.org/pipermail/mlton/2003-January/023054.html); Again, I'm not an expert on it, but from what I remember, the invocation continuation gets lowered to a label, a few movs, and a goto.

@oridb 'there are fast and well known conversions to SSA from flat instruction lists'

Would be interesting to know how they compare with wasm SSA decoders, that is the important question?

Wasm makes use of a values stack at present, and some of the benefits of that would gone without the structure, it would hurt decoder performance. Without the values stack the SSA decoding would have more work too, I've tried a register base code and decoding was slower (not sure how significant that is).

Would you keep the values stack, or use a register based design? If keeping the values stack then perhaps it becomes a CIL clone, and perhaps wasm performance could be compared to CIL, has anyone actually check this?

Would you keep the values stack, or use a register based design?

I don't actually have any strong feelings on that end. I'd imagine compactness of the encoding would be one of the biggest concerns; A register design may not fare that well there -- or it may turn out to compress fantastically over gzip. I don't actually know off the top of my head.

Performance is another concern, although I suspect that it might be less important given the ability to cache binary output, plus the fact that download time may outweigh the decoding by orders of magnitude.

Would be interesting to know how they compare with wasm SSA decoders, that is the important question?

If you're decoding to SSA, that implies that you'd also be doing a reasonable amount of optimization. I'd be curious to benchmark how significant decoding performance is in the first place. But, yes, that's definitely a good question.

Thanks for your questions and concerns.

It's worth noting that many of the designers and implementors of
WebAssembly have backgrounds in high performance, industrial JITs, not only
for JavaScript (V8, SpiderMonkey, Chakra, and JavaScriptCore), but also in
LLVM and other compilers. I personally have implemented two JITs for Java
bytecode and I can attest that a stack machine with unrestricted gotos
introduces quite some complexity in decoding, verifying, and constructing a
compiler IR. In fact, there are many patterns that can be expressed in Java
bytecode that will cause high-performance JITs, including both C1 and C2 in
HotSpot to simply give up and relegate the code to only running in the
interpreter. In contrast, constructing a compiler IR from something like an
AST from JavaScript or another language is something I've also done. The
extra structure of an AST makes some of this work far simpler.

The design of WebAssembly's control flow constructs simplifies consumers by
enabling fast, simple verification, easy, one pass conversion to SSA form
(even a graph IR), effective single-pass JITs, and (with postorder and the
stack machine) relatively simple in-place interpretation. Structured
control makes irreducible control flow graphs impossible, which eliminates
a whole class of nasty corner cases for decoders and compilers. It also
nicely sets the stage for exception handling in WASM bytecode, for which V8
is already developing a prototype in concert with the production
implementation.

We've had a lot of internal discussion between members about this very
topic, since, for a bytecode, it's one thing that is most different from
other machine-level targets. However, it's not any different than targeting
a source language like JavaScript (which many compilers do these days) and
requires only minor reorganization of blocks to achieve structure. There
are known algorithms to do this, and tools. We'd like to provide some
better guidance for those producers with start with an arbitrary CFG to
communicate this better. For languages targeting WASM directly from an AST
(which is actually something V8 does now for asm.js code--directly
translating a JavaScript AST to WASM bytecode), there is no restructuring
step necessary. We expect this to be the case for many language tools
across the spectrum that don't have sophisticated IRs inside.

On Thu, Sep 8, 2016 at 9:53 AM, Ori Bernstein [email protected]
wrote:

Would you keep the values stack, or use a register based design?

I don't actually have any strong feelings on that end. I'd imagine
compactness of the encoding would be one of the biggest concerns; As you
mentioned, performance is another.

Would be interesting to know how they compare with wasm SSA decoders, that
is the important question?

If you're decoding to SSA, that implies that you'd also be doing a
reasonable amount of optimization. I'd be curious to benchmark how
significant decoding performance is in the first place. But, yes, that's
definitely a good question.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/WebAssembly/design/issues/796#issuecomment-245521009,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1Iz1nn4--NL32R9ev0JPKfEnDyvqks5qn77cgaJpZM4J3ofA
.

Thanks @titzer, I was developing a suspicion that Wasm's structure had a purpose beyond just similarity to asm.js. I wonder though: Java bytecode (and CIL) don't model CFGs or the value stack directly, they have to be inferred by the JIT. But in Wasm (especially if block signatures are added) the JIT can easily figure out what's going on with the value stack and control flow, so I wonder, if CFGs (or irreducible control flow specifically) were modeled explicitly like loops and blocks are, might that avoid most of the nasty corner cases you're thinking of?

There's this neat optimization that interpreters use that relies on irreducible control flow to improve branch prediction...

@oridb

I'd like to argue that a flatter, goto based format would be far more useful as
a target for compiler developers

I agree that gotos are very useful for many compilers. That's why tools like Binaryen let you generate arbitrary CFGs with gotos, and they can convert that very quickly and efficiently into WebAssembly for you.

It might help to think of WebAssembly as a thing optimized for browsers to consume (as @titzer pointed out). Most compilers should probably not generate WebAssembly directly, but rather use a tool like Binaryen, so that they can emit gotos, get a bunch of optimizations for free, and don't need to think about low-level binary format details of WebAssembly (instead you emit an IR using a simple API).

Regarding polyfilling with the while-switch pattern you mention: in emscripten we started out that way before we developed the "relooper" method of recreating loops. The while-switch pattern is around 4x slower on average (but in some cases significantly less or more, e.g. small loops are more sensitive). I agree with you that in theory jump-threading optimizations could speed that up, but performance will be less predictable as some VMs will do it better than others. It is also significantly larger in terms of code size.

It might help to think of WebAssembly as a thing optimized for browsers to consume (as @titzer pointed out). Most compilers should probably not generate WebAssembly directly, but rather use a tool like Binaryen...

I'm still not convinced that this aspect is going to matter that much - again, I suspect the cost of fetching the bytecode would dominate the delay the user sees, with the second biggest cost being the optimizations done, and not the parsing and validation. I'm also assuming/hoping that the bytecode would be tossed out, and the compiled output is what would be cached, making the compilation effectively a one-time cost.

But if you were optimizing for web browser consumption, why not simply define web assembly as SSA, which seems to me both more in line with what I'd expect, and less effort to 'convert' to SSA?

You can start to parse and compile while downloading, and some VMs might not do a full compile up front (they might just use a simple baseline for example). So download and compile times can be smaller than expected, and as a result parsing and validation can end up a significant factor in the total delay the user sees.

Regarding SSA representations, they tend to have large code sizes. SSA is great for optimizing code, but not for serializing code compactly.

@oridb See the comment by @titzer 'The design of WebAssembly's control flow constructs simplifies consumers by enabling fast, simple verification, easy, one pass conversion to SSA form ...' - it can generate _verified_ SSA in one pass. Even if wasm used SSA for the encoding it would still have the burden of verifying it, of computing the dominator structure which is easy with the wasm control flow restrictions.

Much of the encoding efficiency of wasm appears to come from being optimized for the common code pattern in which definitions have a single use that is used in stack order. I expect that an SSA encoding could do so too, so it could be of similar encoding efficiency. Operators such as if_else for diamond patterns also help a lot. But without the wasm structure it looks like all basic blocks would need to read definitions from registers and write results to registers, and that might not be so efficient. For example, I think wasm can do even better with a pick operator that could reference scoped stack values up the stack and across basic block boundaries.

I think wasm is not too far from being able to encode most code in SSA style. If definitions were passed up the scope tree as basic block outputs then it might be complete. Might the SSA encoding be orthogonal to the CFG matter. E.g. There could be an SSA encoding with the wasm CFG restrictions, there could be a register based VM with the CFG restrictions.

A goal for wasm is to move the optimization burden out the runtime consumer. There is strong resistance to adding complexity in the runtime compiler, as it increases the attack surface. So much of the design challenge is to ask what can be done to simplify the runtime compiler without hurting performance, and much debate!

Well, it's probably too late now, but I'd like to question the idea that the relooper algorithm, or variants thereof, can produce "good enough" results in all cases. They clearly can in most cases, since most source code doesn't contain irreducible control flow to start with, optimizations don't usually make things too hairy, and if they do, e.g. as part of merging duplicate blocks, they can probably be taught not to. But what about pathological cases? For example, what if you have a coroutine which a compiler has transformed to a regular function with structure like this pseudo-C:

void transformed_coroutine(struct autogenerated_context_struct *ctx) {
    int arg1, arg2; // function args
    int var1, var2, var3, …; // all vars used by the function
    switch (ctx->current_label) { // restore state
    case 0:
        // initial state, load function args caller supplied and proceed to start
        arg1 = ctx->arg1;
        arg2 = ctx->arg2;
        break;
    case 1: 
        // restore all vars which are live at label 1, then jump there
        var2 = ctx->var2; 
        var3 = ctx->var3;
        goto resume_1;
    [more cases…]
    }

    [main body goes here...]
    [somewhere deep in nested control flow:]
        // originally a yield/await/etc.
        ctx->var2 = var2;
        ctx->var3 = var3;
        ctx->current_label = 1;
        return;
        resume_1:
        // continue on
}

So you have mostly normal control flow, but with some gotos pointed at the middle of it. This is roughly how LLVM coroutines work.

I don't think there's any nice way to reloop something like that, if the 'normal' control flow is complex enough. (Could be wrong.) Either you duplicate massive parts of the function, potentially needing a separate copy for every yield point, or you turn the whole thing into a giant switch, which according to @kripken is 4x slower than relooper on typical code (which itself is probably somewhat slower than not needing relooper at all).

The VM could reduce the overhead of a giant switch with jump threading optimizations, but surely it's more expensive for the VM to perform those optimizations, essentially guessing how the code reduces to gotos, than to just accept explicit gotos. As @kripken says, it's also less predictable.

Maybe doing that kind of transformation is a bad idea to start with, since afterward nothing dominates anything so SSA-based optimizations can't do much… maybe it's better done at the assembly level, maybe wasm should eventually get native coroutine support instead? But the compiler can perform most optimizations before doing the transformation, and it seems that at least the designers of LLVM coroutines didn't see an urgent need to delay the transformation until code generation. On the other hand, since there's a fair amount of variety in the exact semantics people want from coroutines (e.g. duplication of suspended coroutines, ability to inspect 'stack frames' for GC), when it comes to designing a portable bytecode (rather than a compiler), it's more flexible to properly support already-transformed code than to have the VM do the transformation.

Anyway, coroutines are just one example. Another example I can think of is implementing a VM-within-a-VM. While a more common feature of JITs is side exits, which don't require goto, there are situations that call for side entries - again, requiring goto into the middle of loops and such. Another would be optimized interpreters: not that interpreters targeting wasm can really match those targeting native code, which at minimum can improve performance with computed gotos, and can dip into assembly for more… but part of the motivation for computed gotos is to better leverage the branch predictor by giving each case its own jump instruction, so you might be able to replicate some of the effect by having a separate switch after each opcode handler, where the cases would all just be gotos. Or at least have an if or two to check for specific instructions that commonly come after the current one. There are some special cases of that pattern that might be representable with structured control flow, but not the general case. And so on…

Surely there's some way to allow arbitrary control flow without making the VM do a lot of work. Straw man idea, might be broken: you could have a scheme where jumps to child scopes are allowed, but only if the number of scopes you have to enter is less than a limit defined by the target block. The limit would default to 0 (no jumps from parent scopes), which preserves the current semantics, and a block's limit can't be greater than the parent block's limit + 1 (easy to check). And the VM would change its dominance heuristic from "X dominates Y if it is a parent of Y" to "X dominates Y if it is a parent of Y with distance greater than Y's child jump limit". (This is a conservative approximation, not guaranteed to represent the exact dominator set, but the same is true for the existing heuristic - it's possible for an inner block to dominate the bottom half of an outer one.) Since only code with irreducible control flow would need to specify a limit, it wouldn't increase code size in the common case.

Edit: Interestingly, that would basically make the block structure into a representation of the dominance tree. I guess it would be much simpler to express that directly: a tree of basic blocks, where a block is allowed to jump to a sibling, ancestor, or immediate child block, but not to a further descendant. I'm not sure how that best maps onto the existing scope structure, where a "block" can consist of multiple basic blocks with sub-loops in between.

FWIW: Wasm has a particular design, which is explained in just a few very significant words "except that the nesting restriction makes it impossible to branch into the middle of a loop from outside the loop".

If it were just a DAG then validation could just check that branches were forward, but with loops this would allow branching into the middle of the loop from outside the loop, hence the nested block design.

The CFG is only part of this design, the other being data flow, and there is a stack of values and blocks can also be organized to unwind the values stack which can very usefully communicate the live range to the consumer which saves work converting to SSA.

It is possible to extend wasm to be an SSA encoding (add pick, allow blocks to return multiple values, and have loop entries pop values), so interestingly the constraints demanded for efficient SSA decoding might not be necessary (because it could already be SSA encoded)! This leads to a functional language (which might have a stack style encoding for efficiency).

If this were extended to handle arbitrary CFG then might it look like the following. This is an SSA style encoding so values are constants. It seems to still fit the stack style to a large extent, just not certain of all the details. So within blocks branches could be made to any other labeled blocks in that set, or some other convention used to transfer control to another block. The code within the block might still usefully reference values on the values stack higher up the stack to save passing them all in.

(func f1 (arg1)
  (let ((c1 10)) ; Some values up the stack.
    (blocks ((b1 (a1 a2 a3)
                   ... (br b3)
               (br b2 (+ a1 a2 a3 arg1 c1)))
             (b2 (a1)
                 ... (br b1 ...))
             (b3 ()
                 ...))
   .. regular structured wasm ..
   (br b2 ...)
   ....
   (br b3)
    ...
   ))

But would web browsers ever handle this efficient internally?

Would someone with a stack machine background recognize the code pattern and be able to match it to a stack encoding?

There is some interesting discussion on irreducible loops here http://bboissin.appspot.com/static/upload/bboissin-thesis-2010-09-22.pdf

I did not follow it all on a quick pass, but it mentions converting irreducible loops to reducible loops by adding an entry node. For wasm it sounds like adding a defined input to loops that is specifically for dispatching within the loop, similar to the current solution but with a defined variable for this. The above mentions this is virtualized, optimized away, in processing. Perhaps something like this could be an option?

If this is on the horizon, and given that producers already need to use a similar technique but using a local variable, then might it be worth considering now so that wasm produced early has potential to run faster on more advanced runtimes? This might also create an incentive for competition between the runtimes to explore this.

This would not exactly be arbitrary labels and gotos but something that these might be transformed into that has some chance of being efficiently compiled in future.

For the record, I am strongly with @oridb and @comex on this issue.
I think this is a critical issue that should be addressed before it is too late.

Given the nature of WebAssembly, any mistakes you make now are likely to stick for decades to come (look at Javascript!). That's why the issue is so critical; avoid supporting gotos now for whatever reason it is (e.g. to ease optimization, which is --- quite frankly --- a specific implementation's influence over a generic thing, and honestly, I think it's lazy), and you'll end up with problems in the long run.

I can already see future (or current, but in the future) WebAssembly implementations trying to special-case recognize the usual while/switch patterns to implement labels in order to handle them properly. That's a hack.

WebAssembly is clean slate, so now is the time to avoid dirty hacks (or rather, the requirements for them).

@darkuranium :

WebAssembly as currently specified is already shipping in browsers and toolchains, and developers have already created code which takes the form laid out in that design. We therefore cannot change the design in a breaking manner.

We can, however, add to the design in a backward-compatible manner. I don't think any of those involved think goto is useless. I suspect we all regularly use goto, and not just in syntactic toy manners.

At this point in time, someone with motivation needs to come up with a proposal which makes sense and implement it. I don't see such a proposal being rejected if it provides solid data.

Given the nature of WebAssembly, any mistakes you make now are likely to stick for decades to come (look at Javascript!). That's why the issue is so critical; avoid supporting gotos now for whatever reason it is (e.g. to ease optimization, which is --- quite frankly --- a specific implementation's influence over a generic thing, and honestly, I think it's lazy), and you'll end up with problems in the long run.

So I'll call your bluff: I think having the motivation you show, and not coming up with a proposal and implementation as I detail above, is quite frankly lazy.

I'm being cheeky of course. Consider that we've got folks banging on our doors for threads, GC, SIMD, etc—all making passionate and sensible arguments for why their feature is most important—it would be great if you could help us tackle one of these issues. There are folks doing so for the other features I mention. None for goto thus far. Please acquaint yourself with this group's contributing guidelines and join the fun.

Otherwise I think goto is a great future feature. Personally I'd probably tackle others first, such as JIT code generation. That's my personal interest after GC and threads.

Hi. I am in middle of writing a translation from webassembly to IR and back to webassembly, and I've had a discussion about this subject with people.

I've been pointed out that irreducible control flow is tricky to represent in webassembly. It can prove out to be troublesome for optimizing compilers that occassionally write out irreducible control flows. This might be something like the loop under, which has multiple entry points:

if (x) goto inside_loop;
// banana
while(y) {
    // things
    inside_loop:
    // do things
}

EBB compilers would produce the following:

entry:
    cjump x, inside_loop
    // banana
    jump loop

loop:
    cjump y, exit
    // things
    jump inside_loop

inside_loop:
    // do things
    jump loop
exit:
    return

Next we get to translating this to webassembly. The problem is that although we have decompilers figured out ages ago, they always had the option of adding the goto into irreducible flows.

Before it gets to be translated, the compiler is going to do tricks on this. But eventually you get to scan through the code and position the beginnings and endings of the structures. You end up with the following candinates after you eliminate the fall-through jumps:

<inside_loop, if(x)>
    // banana
<loop °>
<exit if(y)>
    // things
</inside_loop, if(x)>
    // do things
</loop ↑>
</exit>

Next you need to build a stack out of these. Which one goes to the bottom? It is either the 'inside loop' or then it is the 'loop'. We can't do this so we have to cut the stack and copy things around:

if
    // do things
else
    // banana
end
loop
  br out
    // things
    // do things
end

Now we can translate this to webassembly. Pardon me, I'm not yet familiar with how these loops construct out.

This is not a particular problem if we think about old software. It is likely that the new software is translated to web assembly. But the problem is in how our compilers work. They have been doing the control flow with basic blocks for _decades_ and assume everything goes.

Technically the language is translated in, then translated out. We only need a mechanism that allows the values to flow across the boundaries neat without the drama. The structured flow is only useful for people intending to read the code.

But for example, the following would work just as fine:

    cjump x, label(1)
    // banana
0: label
    cjump y, label(2)
    // things
1: label
    // do things
    jump label(0)
2: label
    // exit as usual, picking the values from the top of the stack.

The numbers would be implicit, that is.. when the compiler sees a 'label', it knows that it starts a new extended block and give it a new index number, starting to increment from 0.

To produce a static stack, you could track how many items are in the stack when you encounter a jump into the label. If there ends up being inconsistent stack after a jump into the label, the program is invalid.

If you find the above bad, you can also try add an explicit stack length into each label (perhaps delta from the last indexed label's stack size, if the absolute value is bad for compression), and a marker to each jump about how many values it copies in from the top of the stack during the jump.

I could bet that you can't outsmart the gzip in any way by the fact how you represent the control flow, so you could choose the flow that's nice for the guys that have the hardest work here. (I can illustrate with my flexible compiler toolchain for the 'outsmarting the gzip' -thing if you like, just send me a message and lets put up a demo!)

I feel like a shatterhead right now. Just re-read through the WebAssembly spec and picked up that the irreducible control flow is intentionally left out from the MVP, perhaps for the reason that emscripten had to solve the problem on the early days.

The solution on how to handle the irreducible control flow in WebAssembly is explained in the paper "Emscripten: An LLVM-to-JavaScript Compiler". The relooper reorganizes the program something like this:

_b_ = bool(x)
_b_ == 0 if
  // banana
end
block loop
  _b_ if
    // do things
    _b_ = 0
  else
    y br_if 2
    // things
    _b_ = 1
  end
  br 0
end end

The rational was that the structured control flow helps reading the source code dump, and I guess it is believed to help the polyfill implementations.

The people compiling from webassembly will probably adapt to handle and separate the collapsed control flow.

So:

  • As mentioned, WebAssembly is now stable, so the time is past for any total rewrite of how control flow is expressed.

    • In one sense, that's unfortunate, because nobody actually tested whether a more directly SSA-based encoding could have achieved the same compactness as the current design.

    • However, when it comes to spec-ing out goto, that makes the job much easier! The block-based instructions are already beyond bikeshedding, and it's not a big deal to expect production compilers targeting wasm to express reducible control flow using them - the algorithm is not that hard. The main problem is that a small fraction of control flow cannot be expressed using them without a performance cost. If we solve that by adding a new goto instruction, we don't have to worry nearly as much about encoding efficiency as we would with a total redesign. Code using goto should still be reasonably compact, of course, but it doesn't have to compete with other constructs for compactness; it's only for irreducible control flow and should be used rarely.

  • Reducibility is not particularly useful.

    • Most compiler backends use an SSA representation based on a graph of basic blocks and branches between them. The nested loop structure, the thing that reducibility guarantees, is pretty much thrown away at the start.

    • I checked the current WebAssembly implementations in JavaScriptCore, V8, and SpiderMonkey, and they all seem to follow this pattern. (V8 is more complicated - some kind of "sea of nodes" representation rather than basic blocks - but also throws away the nesting structure.)

    • Exception: Loop analysis can be useful, and all three of those implementations pass through information to the IR about which basic blocks are the starts of loops. (Compare to LLVM which, as a 'heavyweight' backend designed for AOT compilation, throws it away and recalculates it in the backend. This is more robust, since it can find things that don't look like loops in the source code but do after a bunch of optimizations, but slower.)

    • Loop analysis works on "natural loops", which forbid branches into the middle of the loop that don't go through the loop header.

    • WebAssembly should continue to guarantee that loop blocks are natural loops.

    • But loop analysis doesn't require that the whole function be reducible, nor even the inside of the loop: it just forbids branches from outside to inside. The base representation is still an arbitrary control flow graph.

    • Irreducible control flow does make it harder to compile WebAssembly to JavaScript (polyfilling), as the compiler would have to run the relooper algorithm itself.

    • But WebAssembly already makes multiple decisions that add significant runtime overhead to any compile-to-JS approach (including unaligned memory access support and trapping on out-of-bounds accesses), suggesting that it's not considered very important.

    • Compared to that, making the compiler a little more complex is not a big deal.

    • Therefore, I don't think there's a good reason not to add some kind of support for irreducible control flow.

  • The main information needed to build a SSA representation (which, by design, should be possible in one pass) is the dominator tree.

    • Currently a backend can estimate dominance based on structured control flow. If I understand the spec correctly, the following instructions end a basic block:

    • block:



      • The BB starting the block is dominated by the previous BB.*


      • The BB following the corresponding end is dominated by the BB starting the block, but not by the BB before end (because it'll be skipped if there was a br out).



    • loop:



      • The BB starting the block is dominated by the previous BB.


      • The BB after end is dominated by the BB before end (since you can't get to the instruction after end except by executing end).



    • if:



      • The if side, the else side, and the BB after end are all dominated by the BB before if.



    • br, return, unreachable:



      • (The BB immediately after br, return, or unreachable is unreachable.)



    • br_if, br_table:



      • The BB before br_if/br_table dominates the one after it.



    • Notably, this is only an estimate. It can't produce false positives (saying A dominates B when it actually doesn't) because it only says so when there's no way to get to B without going through A, by construction. But it can produce false negatives (saying A doesn't dominate B when it actually does), and I don't think a single-pass algorithm can detect those (could be wrong).

    • Example false negative:

      ```

      block $outer

      loop

      br $outer ;; since this unconditionally breaks, it secretly dominates the end BB

      end

      end

    • But that's OK, AFAIK.



      • False positives would be bad, because e.g. if basic block A is said to dominate basic block B, the machine code for B can use a register set in A (if nothing in between overwrites that register). If A doesn't actually dominate B, the register might have a garbage value.


      • False negatives are essentially ghost branches that never occur. The compiler assumes that those branches could occur, but not that they must, so the generated code is just more conservative than necessary.



    • Anyway, think about how a goto instruction should work in terms of the dominator tree. Suppose that A dominates B, which dominates C.

    • We can't jump from A to C because that would skip B (violating the assumption of dominance). In other words, we can't jump to non-immediate descendants. (And on the binary producer's end, if they calculated the true dominator tree, there will never be such a jump.)

    • We could safely jump from A to B, but goto'ing to an immediate descendant is not that useful. It's basically equivalent to an if or switch statement, which we can do already (using the if instruction if there's only a binary test, or br_table if there are multiple).

    • Also safe, and more interesting, is jumping to a sibling or an ancestor's sibling. If we jump to our sibling, we preserved the guarantee that our parent dominates our sibling, because we must have already executed our parent to get here (since it dominates us too). Similarly for ancestors.

    • In general, a malicious binary could produce false negatives in dominance this way, but as I said, those are (a) already possible and (b) acceptable.

  • Based on that, here's a strawman proposal:

    • One new block-type instruction:
    • labels resulttype N instr* end
    • There must be exactly N immediate children instructions, where "immediate child" means either a block-type instruction (loop, block, or labels) and everything up to the corresponding end, or a single non-block instruction (which must not affect the stack).
    • Instead of creating a single label like other block-type instructions, labels creates N+1 labels: N pointing to the N children, and one pointing to the end of the labels block. In each of the children, label indices 0 to N-1 refer to the children, in order, and label index N refers to the end.

    In other words, if you have
    loop ;; outer labels 3 block ;; child 0 br X end nop ;; child 1 nop ;; child 2 end end

    Depending on X, the br refers to:

    | X | Target |
    | ---------- | ------ |
    | 0 | end of the block |
    | 1 | child 0 (beginning of the block) |
    | 2 | child 1 (nop) |
    | 3 | child 2 (nop) |
    | 4 | end of labels |
    | 5 | beginning of outer loop |

    • Execution starts at the first child.

    • If execution reaches the end of one of the children, it continues to the next. If it reaches the end of the last child, it goes back to the first child. (This is for symmetry, because the order of the children is not meant to be significant.)

    • Branching to one of the children unwinds the operand stack to its depth at the start of labels.

    • So does branching to the end, but if resulttype is nonempty, branching to the end pops an operand and pushes it after unwinding, similar to block.

    • Dominance: The basic block before the labels instruction dominates each of the children, as well as the BB after the end of labels. The children don't dominate each other or the end.

    • Design notes:

    • N is specified up front so that the code can be validated in one pass. It would be weird to have to get to the end of the labels block, to know the number of children, before knowing the targets of the indices in it.

    • Not sure if there should eventually be a way to pass values on the operand stack between labels, but by analogy with the inability to pass values into a block or loop, that can be unsupported to start with.

It would be really nice if it were possible to jump into a loop though, wouldn't it? IIUC, if that case were accounted for then the nasty loop+br_table combo would never be needed...

Edit: oh, you can make a loops without loop by jumping upward in labels. Can't believe I missed that.

@qwertie If a given loop is not a natural loop, the wasm-targeting compiler should express it using labels instead of loop. It should never be necessary to add a switch to express control flow, if that's what you're referring to. (After all, at worst you could just use one giant labels block with a label for every basic block in the function. This doesn't let the compiler know about dominance and natural loops, so you may miss out on optimizations. But labels is only required in cases where those optimizations aren't applicable.)

The nested loop structure, the thing that reducibility guarantees, is pretty much thrown away at the start. [...] I checked the current WebAssembly implementations in JavaScriptCore, V8, and SpiderMonkey, and they all seem to follow this pattern.

Not quite: at least in SM, the IR graph is not a fully general graph; we assume certain graph invariants that follow from being generated from a structured source (JS or wasm) and often simplify and/or optimize the algorithms. Supporting a fully general CFG would either require auditing/changing many of the passes in the pipeline to not assume these invariants (either by generalizing them or pessimizing them in case of irreducibility) or node-splitting duplication up front to make the graph reducible. This is certainly doable, of course, but it's not true that this is simply a matter of wasm being an artificial bottleneck.

Also, the fact that there are many options and different engines will do different things suggests that having the producer deal with irreducibility up front will produce somewhat more predictable performance in the presence of irreducible control flow.

When we've discussed backwards-compatible paths for extending wasm with arbitrary goto support in the past, one big question is what's the use case here: is it "make producers simpler by not having to run a relooper-type algorithm" or is it "allow more efficient codegen for actually-irreducible control flow"? If it's just the former, then I think we probably would want some scheme of embedding arbitrary labels/gotos (that is both backwards compatible and also composes with future block-structured try/catch); it's just a matter of weighing cost/benefit and the issues mentioned above.

But for the latter use case, one thing we've observed is that, while you do every now and then see a Duff's device case in the wild (which isn't actually an efficient way to unroll a loop...), often where you see irreducibility pop up where performance matters is interpreter loops. Interpreter loops also benefit from indirect threading which needs computed goto. Also, even in beefy offline compilers, interpreter loops tend to get the worst register allocation. Since interpreter loop performance can be pretty important, one question is whether what we really need is a control flow primitive that allows the engine to perform indirect threading and do decent regalloc. (This is an open question to me.)

@lukewagner
I'd like to hear more detail about which passes are depending on invariants. The design I proposed, using a separate construct for irreducible flow, should make it relatively easy for optimization passes like LICM to steer clear of that flow. But if there are other types of breakage I'm not thinking of, I'd like to understand their nature better so I can get a better idea of whether and how they can be avoided.

When we've discussed backwards-compatible paths for extending wasm with arbitrary goto support in the past, one big question is what's the use case here: is it "make producers simpler by not having to run a relooper-type algorithm" or is it "allow more efficient codegen for actually-irreducible control flow"?

For me it's the latter; my proposal expects producers to still run a relooper-type algorithm to save the backend the work of identifying dominators and natural loops, falling back to labels only when necessary. However, this would still make producers simpler. If irreducible control flow has a large penalty, an ideal producer should work very hard to avoid it, using heuristics to determine whether it's more efficient to duplicate code, the minimal amount of duplication that can work, etc. If the only penalty is potentially giving up loop optimizations, this isn't really necessary, or at least is no more necessary than it would be with a regular machine code backend (which has its own loop optimizations).

I really should gather more data on how common irreducible control flow is in practice…

However, my belief is that penalizing such flow is essentially arbitrary and unnecessary. In most cases, the effect on overall program runtime should be small. However, if a hotspot happens to include irreducible control flow, there will be a severe penalty; in the future, WebAssembly optimization guides might include this as a common gotcha, and explain how to identify and avoid it. If my belief is correct, this is an entirely unnecessary form of cognitive overhead for programmers. And even when the overhead is small, WebAssembly already has enough overhead compared to native code that it should seek to avoid any extra.

I'm open to persuasion that my belief is incorrect.

Since interpreter loop performance can be pretty important, one question is whether what we really need is a control flow primitive that allows the engine to perform indirect threading and do decent regalloc.

That sounds interesting, but I think it would be better to start with a more general-purpose primitive. After all, a primitive tailored for interpreters would still require backends to deal with irreducible control flow; if you're going to bite that bullet, may as well support the general case too.

Alternately, my proposal might already serve as a decent primitive for interpreters. If you combine labels with br_table, you get the ability to point a jump table directly at arbitrary points in the function, which is not that different from a computed goto. (As opposed to a C switch, which at least initially directs control flow to points within the switch block; if the cases are all gotos, the compiler should be able to optimize away the extra jump, but it might also coalesce multiple 'redundant' switch statements into one, ruining the benefit of having a separate jump after each instruction handler.) I'm not sure what the issue with register allocation is, though...

@comex I guess one could simply turn off whole optimization passes at the function level in the presence of irreducible control flow (although SSA generation, regalloc, and a probably a few others would be needed and thus require work), but I was assuming we wanted to actually generate quality code for functions with irreducible control flow and that involves auditing each algorithm that previously assumed a structured graph.

>

The nested loop structure, the thing that reducibility guarantees, is
pretty much thrown away at the start. [...] I checked the current
WebAssembly implementations in JavaScriptCore, V8, and SpiderMonkey, and
they all seem to follow this pattern.

Not quite: at least in SM, the IR graph is not a fully general graph; we
assume certain graph invariants that follow from being generated from a
structured source (JS or wasm) and often simplify and/or optimize the
algorithms.

Same in V8. It is actually one of my major gripes with SSA in both
respective literature and implementations that they almost never define
what constitutes a "well-formed" CFG, but tend to implicitly assume various
undocumented constraints anyways, usually ensured by construction by the
language frontend. I bet that many/most optimisations in existing compilers
would not be able to deal with truly arbitrary CFGs.

As @lukewagner says, the main use case for irreducible control probably is
"threaded code" for optimised interpreters. Hard to say how relevant those
are for the Wasm domain, and whether its absence actually is the biggest
bottleneck.

Having discussed irreducible control flow with a number of people
researching compiler IRs, the "cleanest" solution probably would be to add
the notion of mutually recursive blocks. That would happen to fit Wasm's
control structure quite well.

Loop optimizations in LLVM will generally ignore irreducible control flow and not attempt to optimize it. The loop analysis they're based on will only recognize natural loops, so you just have to be aware that there can be CFG cycles that are not recognized as loops. Of course, other optimizations are more local in nature and work just fine with irreducible CFGs.

From memory, and probably wrong, SPEC2006 has a single irreducible loop in 401.bzip2 and that's it. It's quite rare in practice.

Clang will only emit a single indirectbr instruction in functions using computed goto. This has the effect of turning threaded interpreters into natural loops with the indirectbr block as a loop header. After leaving LLVM IR, the single indirectbr is tail-duplicated in the code generator to reconstruct the original tangle.

There is no single-pass verification algorithm for irreducible control flow
that I am aware of. The design choice for reducible control flow only was
highly influenced by this requirement.

As mentioned earlier, irreducible control flow can be modeled at least two
different ways. A loop with a switch statement can actually be optimized
into the original irreducible graph by a simple local jump-threading
optimization (e.g. by folding the pattern where an assignment of a constant
to a local variable occurs, then a branch to a conditional branch that
immediately switches on that local variable).

So the irreducible control constructs are not necessary at all, and it is
only a matter of a single compiler backend transformation to recover the
original irreducible graph and optimize it (for engines whose compilers
support irreducible control flow--which none of the 4 browsers do, to the
best of my knowledge).

Best,
-Ben

On Thu, Apr 20, 2017 at 5:20 AM, Jakob Stoklund Olesen <
[email protected]> wrote:

Loop optimizations in LLVM will generally ignore irreducible control flow
and not attempt to optimize it. The loop analysis they're based on will
only recognize natural loops, so you just have to be aware that there can
be CFG cycles that are not recognized as loops. Of course, other
optimizations are more local in nature and work just fine with irreducible
CFGs.

From memory, and probably wrong, SPEC2006 has a single irreducible loop in
401.bzip2 and that's it. It's quite rare in practice.

Clang will only emit a single indirectbr instruction in functions using
computed goto. This has the effect of turning threaded interpreters into
natural loops with the indirectbr block as a loop header. After leaving
LLVM IR, the single indirectbr is tail-duplicated in the code generator
to reconstruct the original tangle.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/WebAssembly/design/issues/796#issuecomment-295352983,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1K99AR5YaQuNOIFIckLLSIZbmbd0ks5rxkJQgaJpZM4J3ofA
.

I can also say further that if irreducible constructs were to be added to
WebAssembly, they would not work in TurboFan (V8's optimizing JIT), so such
functions would either end up being interpreted (extremely slow) or being
compiled by a baseline compiler (somewhat slower), since we will likely not
invest effort in upgrading TurboFan to support irreducible control flow.
That means functions with irreducible control flow in WebAssembly would
probably end up with much worse performance.

Of course, another option would for the WebAssembly engine in V8 to run the
relooper to feed TurboFan reducible graphs, but that would make compilation
(and startup worse). Relooping should remain an offline procedure in my
opinion, otherwise we are ending with up inescapable engine costs.

Best,
-Ben

On Mon, May 1, 2017 at 12:48 PM, Ben L. Titzer titzer@google.com wrote:

There is no single-pass verification algorithm for irreducible control
flow that I am aware of. The design choice for reducible control flow only
was highly influenced by this requirement.

As mentioned earlier, irreducible control flow can be modeled at least two
different ways. A loop with a switch statement can actually be optimized
into the original irreducible graph by a simple local jump-threading
optimization (e.g. by folding the pattern where an assignment of a constant
to a local variable occurs, then a branch to a conditional branch that
immediately switches on that local variable).

So the irreducible control constructs are not necessary at all, and it is
only a matter of a single compiler backend transformation to recover the
original irreducible graph and optimize it (for engines whose compilers
support irreducible control flow--which none of the 4 browsers do, to the
best of my knowledge).

Best,
-Ben

On Thu, Apr 20, 2017 at 5:20 AM, Jakob Stoklund Olesen <
[email protected]> wrote:

Loop optimizations in LLVM will generally ignore irreducible control flow
and not attempt to optimize it. The loop analysis they're based on will
only recognize natural loops, so you just have to be aware that there can
be CFG cycles that are not recognized as loops. Of course, other
optimizations are more local in nature and work just fine with irreducible
CFGs.

From memory, and probably wrong, SPEC2006 has a single irreducible loop
in 401.bzip2 and that's it. It's quite rare in practice.

Clang will only emit a single indirectbr instruction in functions using
computed goto. This has the effect of turning threaded interpreters into
natural loops with the indirectbr block as a loop header. After leaving
LLVM IR, the single indirectbr is tail-duplicated in the code generator
to reconstruct the original tangle.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/WebAssembly/design/issues/796#issuecomment-295352983,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1K99AR5YaQuNOIFIckLLSIZbmbd0ks5rxkJQgaJpZM4J3ofA
.

There are established methods for linear-time verification of irreducible control flow. A notable example is the JVM: with stackmaps, it has linear-time verification. WebAssembly already has block signatures on every block-like construct. With explicit type information at every point where multiple control flow paths merge, it's not necessary to used fixed-point algorithms.

(As an aside, a while ago I asked why one would disallow a hypothetical pick operator from reading outside its block at arbitrary depths. This is one answer: unless signatures are extended to describe everything a pick might read, type-checking the pick would require more information.)

The loop-with-a-switch pattern can of course be jump-threaded away, but it's not practical to rely on. If an engine doesn't optimize it away, it would have a disruptive level of overhead. If most engines do optimize it, then it's unclear what's accomplished by keeping irreducible control flow out of the language itself.

Sigh… I meant to reply earlier but life got in the way.

I’ve been grepping through some JS engines and I guess I have to weaken my claim about irreducible control flow ‘just working’. I still don’t think it would be that hard to make it work, but there are some constructs that would be difficult to adapt in a way that would actually benefit over…

Well, let’s assume, for the sake of argument, that making the optimization pipeline support irreducible control flow properly is too hard. A JS engine can still easily support it in a hacky way, like this:

Within the backend, treat a labels block as if it were a loop+switch until the last minute. In other words, when you see a labels block, you treat it as a loop header with an outward edge pointing to each label, and when you see a branch that targets a label, you create an edge pointing to the labels header, not the actual target label - which should be stored separately somewhere. No need to create an actual variable to store the target label, like a real loop+switch would have to do; it should be sufficient to stash the value in some field of the branch instruction, or create a separate control instruction for the purpose. Then, optimizations, scheduling, even register allocation can all pretend there are two jumps. But when the time comes to actually generate a native jump instruction, you check that field, and generate a jump directly to the target label.

There might be issues with, e.g., any optimization that merges/deletes branches, but it should be pretty easy to avoid that; the details depend on the engine design.

In some sense, my suggestion is equivalent to @titzer’s “simple local jump-threading optimization”. I’m suggesting making ‘native’ irreducible control flow look like a loop+switch, but an alternative would be to identify real loop+switches – that is, @titzer’s “pattern where an assignment of a constant to a local variable occurs, then a branch to a conditional branch that immediately switches on that local variable” – and add metadata allowing the indirect branch to be removed late in the pipeline. If this optimization becomes ubiquitous, it could be a decent substitute for an explicit instruction.

Either way, the obvious downside to the hacky approach is that optimizations don’t understand the real control flow graph; they effectively act as if any label could jump to any other label. In particular, register allocation has to treat a variable as live in all labels, even if, say, it’s always assigned right before jumping to a specific label, as in this pseudocode:

a:
  control = 1;
  goto x;
b:
  control = 2;
  goto x;
...
x:
  // use control

That could lead to seriously suboptimal register use in some cases. But as I’ll note later, the liveness algorithms that JITs use may be fundamentally unable to do this well, anyway…

Whatever the case, optimizing late is a lot better than not optimizing at all. A single direct jump is much nicer than a jump + compare + load + indirect jump; the CPU branch predictor may eventually be able to predict the latter’s target based on past state, but not as well as the compiler can. And you can avoid spending a register and/or memory on the ‘current state’ variable.

As for the representation, which is better: explicit (labels instruction or similar) or implicit (optimization of real loop+switches following a specific pattern)?

Benefits to implicit:

  • Keeps the specification lean.

  • Might already work with existing loop+switch code. But I haven’t looked at the stuff binaryen generates to see if it follows a strict enough pattern.

  • Making the blessed way of expressing irreducible control flow feel like a hack highlights the fact that it’s slower in general and should be avoided when possible.

Drawbacks to implicit:

  • It feels like a hack. True, as @titzer says, it doesn’t actually disadvantage engines that ‘properly’ support irreducible control flow; they can recognize the pattern early and recover the original irreducible flow before performing optimizations. Still, it seems neater to just allow the real jumps.

  • Creates an “optimization cliff”, which WebAssembly is generally supposed to avoid compared to JS. To recount, the basic pattern to be optimized is “where an assignment of a constant to a local variable occurs, then a branch to a conditional branch that immediately switches on that local variable”. But what if, say, there are some other instructions in between, or the assignment isn’t actually using a wasm const instruction but merely something known as constant due to optimizations? Some engines may be more liberal than others in what they recognize as this pattern, but then code that takes advantage of that (intentionally or not) will have vastly different performance between browsers. Having a more explicit encoding sets expectations more clearly.

  • Makes it harder to use wasm like an IR in hypothetical postprocessing steps. If a wasm-targeting compiler does things the normal way and handles all optimizations/transformations with an internal IR before eventually running a relooper and ultimately generating wasm, then it wouldn’t mind the existence of magic instruction sequences. But if a program wants to run any transformations on wasm code itself, it would have to avoid breaking up those sequences, which would be annoying.

Anyway, I don’t care that much either way - as long as, should we decide on the implicit approach, the major browsers actually commit to performing the relevant optimization.

Going back to the question of supporting irreducible flow natively - what the obstacles are, how much benefit there is - here are some specific examples from IonMonkey of optimization passes that would have to be modified to support it:

AliasAnalysis.cpp: iterates over blocks in reverse postorder (once), and generates ordering dependencies for an instruction (as used in InstructionReordering) by looking only at previously seen stores as possibly aliasing. This doesn’t work for cyclic control flow. But (explicitly marked) loops are handled specially, with a second pass that checks instructions in loops against any later stores anywhere in the same loop.

-> So there’d have to be some loop marking for labels blocks. In this case, I think marking the whole labels block as a loop would ‘just work’ (without specially marking the individual labels), as the analysis is too imprecise to care about the control flow within the loop.

FlowAliasAnalysis.cpp: an alternative algorithm which is a bit smarter. Also iterates over blocks in reverse postorder, but on encountering each block it merges the calculated last-stores information for each of its predecessors (assumed to have already been calculated), except for loop headers, where it takes the backedge into account.

-> Messier because it assumes (a) predecessors to individual basic blocks always appear before it except for loop backedges, and (b) a loop can only have one backedge. There are different ways this could be fixed, but it would probably require explicit handling of labels, and for the algorithm to stay linear, it would probably have to work pretty crudely in that case, more like regular AliasAnalysis - reducing the benefit compared to the hacky approach. Not sure how heavyweight compilers handle this type of optimization.

BacktrackingAllocator.cpp: similar behavior for register allocation: it does a linear reverse pass through the list of instructions and assumes that all uses of an instruction will appear after (i.e. be processed before) its definition, except when encountering loop backedges: registers which are live at the beginning of a loop simply stay live through the entire loop.

-> Every label would need to be treated like a loop header, but the liveness would have to extend for the entire labels block. Not hard to implement, but again, the result would be no better than the hacky approach. I think.

@comex Another consideration here is how much wasm engines are expected to do. For example, you mention Ion's AliasAnalysis above, however the other side of the story is that alias analysis isn't that important for WebAssembly code, at least for now while most programs are using linear memory.

Ion's BacktrackingAllocator.cpp liveness algorithm would require some work, but it wouldn't be prohibitive. Most of Ion already does handle various forms of irreducible control flow, since OSR can create multiple entries into loops.

An broader question here is what optimizations WebAssembly engines will be expected to do. If one expects WebAssembly to be an assembly-like platform, with predictable performance where producers/libraries do most of the optimization, then irreducible control flow would be a pretty low cost because engines wouldn't need the big complex algorithms where it's a significant burden. If one expects WebAssembly to be a higher-level bytecode, which does more high-level optimization automatically, and engines are more complex, then it becomes more valuable to keep irreducible control-flow out of the language, to avoid the extra complexity.

BTW, also worth mentioning in this issue is Braun et al's on-the-fly SSA construction algorithm, which is a simple and fast on-the-fly SSA construction algorithm, and supports irreducible control flow.

I'm interested in using WebAssembly as a qemu backend on iOS, where WebKit (and the dynamic linker, but that checks code signing) is the only program that is allowed to mark memory as executable. Qemu's codegen assumes that goto statements will be a part of any processor it has to codegen for, which makes a WebAssembly backend almost impossible without gotos being added.

@tbodt - Would you be able to use Binaryen's relooper? That let's you generate what is basically Wasm-with-goto and then converts it into structured control flow for Wasm.

@eholk That sounds like it would be much much slower than a direct translation of machine code to wasm.

@tbodt Using Binaryen does add an extra IR on the way, yeah, but it shouldn't be much slower, I think, it's optimized for compilation speed. And it may also have benefits other than handling gotos etc. as you can optionally run the Binaryen optimizer, which may do things the qemu optimizer doesn't (wasm-specific things).

Actually I would be very interested to collaborate with you on that, if you want :) I think porting Qemu to wasm would be very useful.

So on second thought, gotos wouldn't really help a whole lot. Qemu's codegen generates the code for basic blocks when they are first run. If a block jumps to a block that hasn't been generated yet, it generates the block and patches the previous block with a goto to the next block. Dynamic code loading and patching of existing functions are not things that can be done in webassembly, as far as I know.

@kripken I'd be interested in collaborating, where would be the best place to chat with you?

You can't patch existing functions directly, but you can use call_indirect and the a WebAssembly.Table to jit code. For any basic block that hasn't been generated, you can call out to JavaScript, generate the WebAssembly module and instance synchronously, extract the exported function and write it over the index in the table. Future calls will then use your generated function.

I'm not sure that anyone has tried this yet, though, so there's likely to be many rough edges.

That could work if tailcalls were implemented. Otherwise the stack would overflow pretty quickly.

Another challenge would be allocating space in the default table. How do you map an address to a table index?

Another option is to regenerate the wasm function on each new basic block. This means a number of recompiles equal to the number of used blocks, but I'd bet it's the only way to get the code to run quickly after it is compiled (especially inner loops), and it doesn't need to be a full recompile, we can reuse the Binaryen IR for each existing block, add IR for the new block, and just run the relooper on all of them.

(But maybe we can get qemu to compile the whole function up front instead of lazily?)

@tbodt for collaboration on doing this with Binaryen, one option is to create a repo with your work (and can use issues there etc.), another is to open a specific issue in Binaryen for qemu.

We can't get qemu to compile a whole function at a time, because qemu doesn't have a concept of a "function".

As for recompiling the whole cache of blocks, that sounds like it might take a long time. I'll figure out how to use qemu's builtin profiler and then open an issue on binaryen.

Side question. In my view, a language targeting WebAssembly should be able to provide efficient mutually recursive function. For a depiction of their usefulness I'd invite you to read: http://sharp-gamedev.blogspot.com/2011/08/forgotten-control-flow-construct.html

In particular, the need expressed by Cheery seems to be addressed by mutually recursive function.

I understand the need for tail recursion, but I'm wondering if mutually recursive function can only be implemented if the underlying machinery provides gotos or not. If they do, then to me that makes for legitimate argument in favour of them since there'll be a ton of programming language that'll have a hard time targeting WebAssembly otherwise. If they don't then perhaps the minimum mechanism to support mutually recursive function is all that would be needed (along with tail recursion).

@davidgrenier, the functions in a Wasm module are all mutually recursive. Can you elaborate what you regard as inefficient about them? Are you only referring to the lack of tail calls or something else?

General tail calls are coming. Tail recursion (mutual or otherwise) is gonna be a special case of that.

I wasn't saying anything was inefficient about them. I'm saying, that if you have them, you don't need general goto because mutually recursive functions provide all that language implementer targeting WebAssembly should need.

Goto is very useful for code generation from diagrams in visual programming. Maybe now visual programming is not very popular but in the future it can get more people and I think wasm should be ready for it. More about code generation from the diagrams and goto: http://drakon-editor.sourceforge.net/generation.html

The upcoming Go 1.11 release will have experimental support for WebAssembly. This will include full support for all of Go's features, including goroutines, channels, etc. However, the performance of the generated WebAssembly is currently not that good.

This is mainly because of the missing goto instruction. Without the goto instruction we had to resort to using a toplevel loop and jump table in every function. Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function. The relooper can not help with this, only a goto instruction can.

It is awesome that WebAssembly got to the point where it can support a language like Go. But to be truly the assembly of the web, WebAssembly should be equally powerful as other assembly languages. Go has an advanced compiler which is able to emit very efficient assembly for a number of other platforms. This is why I would like to argue that it is mainly a limitation of WebAssembly and not of the Go compiler that it is not possible to also use this compiler to emit efficient assembly for the web.

Using the relooper algorithm is not an option for us, because when switching between goroutines we need to be able to resume execution at different points of a function.

Just to clarify, a regular goto would not be enough for that, a computed goto is required for your use case, is that correct?

I think a regular goto would probably be sufficient in terms of performance. Jumps between basic blocks are static anyways and for switching goroutines a br_table with gotos in its branches should be performant enough. Output size is a different question though.

It sounds like you have normal control flow in each function, but also need the ability to jump from the function entry to certain other locations in the "middle", when resuming a goroutine - how many such locations are there? If it's every single basic block, then the relooper would be forced to emit a toplevel loop that every instruction goes through, but if it's just a few, that shouldn't be a problem. (That's actually what happens with setjmp support in emscripten - we just create the extra necessary paths between LLVM's basic blocks, and let the relooper process that normally.)

Every call to some other function is such a location and most basic blocks have at least one call instruction. We're more or less unwinding and restoring the call stack.

I see, thanks. Yeah, I agree that for that to be practical you need either static goto or call stack restoring support (which has also been considered).

Will it be possible to call function in CPS style or implement call/cc in WASM?

@Heimdell, support for some form of delimited continuations (a.k.a. "stack switching") is on the road map, which should be enough for almost any interesting control abstraction. We cannot support undelimited continuations (i.e., full call/cc), though, since the Wasm call stack can be arbitrarily intermixed with other languages, including reentrant calls out to the embedder, and thus cannot assumed to be copyable or movable.

Reading through this thread, I get the impression that arbitrary labels and gotos has a major hurdle before becoming a feature:

  • Unstructured control flow makes irreducible control flow graphs possible
  • Eliminating* any "fast, simple verification, easy, one pass conversion to SSA form"
  • Opening up the JIT compiler to nonlinear performance
  • People browsing webpages shouldn't have to suffer delays if the original language compiler can do the up-front work

_*although there might be alternatives such as Braun et al's on-the-fly SSA construction algorithm which handles irreducible control flow_

If we're still stuck there, _and_ tail calls are moving forward, maybe it'd be worth asking language compilers to still translate to gotos, but as a final step before WebAssembly output, split up the "label blocks" into functions, and convert the gotos into tail calls.

According to Scheme designer Guy Steele's 1977 paper, Lambda: The Ultimate GOTO, the transformation should be possible, and the performance of tail calls should be able to closely match gotos.

Thoughts?

If we're still stuck there, _and_ tail calls are moving forward, maybe it'd be worth asking language compilers to still translate to gotos, but as a final step before WebAssembly output, split up the "label blocks" into functions, and convert the gotos into tail calls.

This is essentially what every compiler would do anyway, no-one that I know of is advocating for unmanaged gotos of the kind that cause so many problems in the JVM, just for a graph of typed EBBs. LLVM, GCC, Cranelift and the rest all have a (possibly-irreducible) SSA-form CFG as their internal representation and the compilers from Wasm to native have the same internal representation, so we want to preserve as much of that information as possible and reconstruct as little of that information as possible. Locals are lossy, since they're no longer SSA, and Wasm's control flow is lossy, since it's no longer an arbitrary CFG. AFAIK having Wasm be an infinite-register SSA register machine with embedded fine-grained register liveness information would probably be the best for codegen but code size would bloat, a stack machine with control flow modelled on an arbitrary CFG is probably the best middle-ground. I might be wrong about code size with a register machine though, it might be possible to encode it efficiently.

The thing is about irreducible control flow is that if it's irreducible on the front-end it's still irreducible in wasm, the relooper/stackifier conversion doesn't make the control flow reducible, it just converts the irreducibility to be dependent on runtime values. This gives the backend less information and so it can produce worse code, the only way to produce good code for irreducible CFGs right now is to detect the patterns emitted by relooper and stackifier and convert them back to an irreducible CFG. Unless you're developing V8, which AFAIK only supports reducible control flow, supporting irreducible control flow is purely a win - it makes both frontends and backends way simpler (frontends can just emit code in the same format they internally store it as, backends don't have to detect patterns) while producing better output in the case that the control flow is irreducible and output that is just as good or better in the usual case that control flow is reducible.

Plus it would allow GCC and Go to start producing WebAssembly.

I know V8 is an important component of the WebAssembly ecosystem but it seems to be the only part of that ecosystem that benefits from the current control flow situation, all the other backends that I'm aware of convert to a CFG anyway and are unaffected by whether WebAssembly can represent irreducible control flow or not.

Couldn't v8 just incorporate relooper in order to accept input CFGs? It seems like large chunks of the ecosystem are blocked on implementation details of v8.

Just for reference, I noticed switch statements in c++ are very slow in wasm. When I profiled code I had to convert them to other forms which operated much more quickly to do image processing. And it was never a problem on other architectures. I really would like goto for performance reasons.

@graph, could you provide more details about how "switch statements are slow"? Always looking for an opportunity to improve performance... (If you don't want to bog down this thread, email me directly, [email protected].)

I'll post here as this applies to all browsers. Simple statements like this when compiled with emscripten where faster when I converted to if statements.

for(y = ....) {
    for(x = ....) {
        switch(type){
        case IS_RGBA:....
         ....
        case IS_BGRA
        ....
        case IS_RGB
        ....
....

I assume the compiler was converting a jump table to whatever wasm supports. I didn't look into generated assembly so can't confirm.

I know a couple unrelated to wasm things that can be optimized for image processing on the web. I already submitted it via "feedback" button in firefox. If you are interested let me know and I'll email you the issues.

@graph A complete benchmark would be very helpful here. In general, a switch in C can turn into a very fast jump table in wasm, but there are corner cases that don't work well yet, that we may need to fix, either in LLVM or in browsers.

In emscripten specifically, how switches are handled changes a lot between the old fastcomp backend and the new upstream one, so if you saw this a while ago, or recently but using fastcomp, it would be good to check on upstream.

@graph, If emscripten produces a br_table then the jit will sometimes generate a jump table and sometimes (if it thinks it will be faster) it will search the key space linearly or with an in-line binary search. What it does often depends on the size of the switch. It is of course possible that the selection policy is not optimal... I agree with @kripken, runnable code would be very helpful here if you have some to share.

(Don't know about v8 or jsc, but Firefox currently does not recognize an if-then-else chain as a possible switch, so it's usually not a good idea to open-code switches as long if-then-else chains. The break-even point is probably at no more than two or three comparisons.)

@lars-t-hansen @kripken @graph it may well be that br_table is currently just very un-optimized as this exchange seems to show: https://twitter.com/battagline/status/1168310096515883008

@aardappel, that's curious, benchmarks i ran yesterday did not show this, in firefox on my system the break-even point was at around 5 cases as i remember it and after that br_table was the winner. microbenchmark of course, and with some attempt at even distribution of the lookup keys. if the "if" nest is biased toward the most likely keys so that no more than a couple of tests are needed then the "if" nest will win.

If it can't do the range analysis on the switch value to avoid it then the br_table will also have to do at least one filtering test for the range of the switch, which also eats into its advantage.

@lars-t-hansen Yes, we don't know his test case, may it had an outlier value. Either way, looks like Chrome has more work to do than Firefox.

Im on vacation, hence my lack of replies. Thanks for understanding.

@kripken @lars-t-hansen I have run some tests it seems yes it wasm is better now in firefox. There is still some cases where if-else out-performs switch. Here is a case:


Main.cpp

#include <stdio.h>

#include <chrono>
#include <random>

class Chronometer {
public:
    Chronometer() {

    }

    void start() {
        mStart = std::chrono::steady_clock::now();
    }

    double seconds() {
        std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
        return std::chrono::duration_cast<std::chrono::duration<double>>(end - mStart).count();
    }

private:
    std::chrono::steady_clock::time_point mStart;
};

int main() {
    printf("Starting tests!\n");
    Chronometer timer;
    // we want to prevent optimizations based on known size as most applications
    // do not know the size in advance.
    std::random_device rd;  //Will be used to obtain a seed for the random number engine
    std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
    std::uniform_int_distribution<> dis(100000000, 1000000000);
    std::uniform_int_distribution<> opKind(0, 3);
    int maxFrames = dis(gen);
    int switchSelect = 0;
    constexpr int SW1 = 1;
    constexpr int SW2 = 8;
    constexpr int SW3 = 32;
    constexpr int SW4 = 38;

    switch(opKind(gen)) {
    case 0:
        switchSelect = SW1;
        break;
    case 1:
        switchSelect = SW2; break;
    case 2:
        switchSelect = SW3; break;
    case 4:
        switchSelect = SW4; break;
    }
    printf("timing with SW = %d\n", switchSelect);
    timer.start();
    int accumulator = 0;
    for(int i = 0; i < maxFrames; ++i) {
        switch(switchSelect) {
        case SW1:
            accumulator = accumulator*3 + i; break;
        case SW2:
            accumulator = (accumulator < 3)*i; break;
        case SW3:
            accumulator = (accumulator&0xFF)*i + accumulator; break;
        case SW4:
            accumulator = (accumulator*accumulator) - accumulator + i; break;
        }
    }
    printf("switch time = %lf seconds\n", timer.seconds());
    printf("accumulated value: %d\n", accumulator);
    timer.start();
    accumulator = 0;
    for(int i = 0; i < maxFrames; ++i) {
        if(switchSelect == SW1)
            accumulator = accumulator*3 + i;
        else if(switchSelect == SW2)
            accumulator = (accumulator < 3)*i;
        else if(switchSelect == SW3)
            accumulator = (accumulator&0xFF)*i + accumulator;
        else if(switchSelect == SW4)
            accumulator = (accumulator*accumulator) - accumulator + i;
    }
    printf("if-else time = %lf seconds\n", timer.seconds());
    printf("accumulated value: %d\n", accumulator);

    return 0;
}

Depending on the value of switchSelect. if-else outperforms. Example output:

Starting tests!
timing with SW = 32
switch time = 2.049000 seconds
accumulated value: 0
if-else time = 0.401000 seconds
accumulated value: 0

As you can see for switchSelect = 32 if-else is way faster. For the other cases if-else is a bit faster. For the case switchSelect = 1 & 0, switch statement is faster.

Test in Firefox 69.0.3 (64-bit)
compiled using: emcc -O3 -std=c++17 main.cpp -o main.html
emcc version: emcc (Emscripten gcc/clang-like replacement) 1.39.0 (commit e047fe4c1ecfae6ba471ca43f2f630b79516706b)

Using latest stable emscripen as of Oct 20 2019. Fresh install ./emcc activate latest.

I noticed above there is a typo, but it shouldn't effect the matter that the if-else is faster SW3 case as they are executing same instructions.

again with this going beyond break even point of 5: Interesting that for switchSelect=32 for this case it is similar in speed as if-else. As you can see for 1003 if-else is slightly faster. Switch should win in this case.

Starting tests!
timing with SW = 1003
switch time = 2.253000 seconds
accumulated value: 1903939380
if-else time = 2.197000 seconds
accumulated value: 1903939380


main.cpp

#include <stdio.h>

#include <chrono>
#include <random>

class Chronometer {
public:
    Chronometer() {

    }

    void start() {
        mStart = std::chrono::steady_clock::now();
    }

    double seconds() {
        std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
        return std::chrono::duration_cast<std::chrono::duration<double>>(end - mStart).count();
    }

private:
    std::chrono::steady_clock::time_point mStart;
};

int main() {
    printf("Starting tests!\n");
    Chronometer timer;
    // we want to prevent optimizations based on known size as most applications
    // do not know the size in advance.
    std::random_device rd;  //Will be used to obtain a seed for the random number engine
    std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
    std::uniform_int_distribution<> dis(100000000, 1000000000);
    std::uniform_int_distribution<> opKind(0, 8);
    int maxFrames = dis(gen);
    int switchSelect = 0;
    constexpr int SW1 = 1;
    constexpr int SW2 = 8;
    constexpr int SW3 = 32;
    constexpr int SW4 = 38;
    constexpr int SW5 = 64;
    constexpr int SW6 = 67;
    constexpr int SW7 = 1003;
    constexpr int SW8 = 256;

    switch(opKind(gen)) {
    case 0:
        switchSelect = SW1;
        break;
    case 1:
        switchSelect = SW2; break;
    case 2:
        switchSelect = SW3; break;
    case 3:
        switchSelect = SW4; break;
    case 4:
        switchSelect = SW5; break;
    case 5:
        switchSelect = SW6; break;
    case 6:
        switchSelect = SW7; break;
    case 7:
        switchSelect = SW8; break;
    }
    printf("timing with SW = %d\n", switchSelect);
    timer.start();
    int accumulator = 0;
    for(int i = 0; i < maxFrames; ++i) {
        switch(switchSelect) {
        case SW1:
            accumulator = accumulator*3 + i; break;
        case SW2:
            accumulator = (accumulator < 3)*i; break;
        case SW3:
            accumulator = (accumulator&0xFF)*i + accumulator; break;
        case SW4:
            accumulator = (accumulator*accumulator) - accumulator + i; break;
        case SW5:
            accumulator = (accumulator << 3) - accumulator + i; break;
        case SW6:
            accumulator = (i - accumulator) & 0xFF; break;
        case SW7:
            accumulator = i*i + accumulator; break;
        }
    }
    printf("switch time = %lf seconds\n", timer.seconds());
    printf("accumulated value: %d\n", accumulator);
    timer.start();
    accumulator = 0;
    for(int i = 0; i < maxFrames; ++i) {
        if(switchSelect == SW1)
            accumulator = accumulator*3 + i;
        else if(switchSelect == SW2)
            accumulator = (accumulator < 3)*i;
        else if(switchSelect == SW3)
            accumulator = (accumulator&0xFF)*i + accumulator;
        else if(switchSelect == SW4)
            accumulator = (accumulator*accumulator) - accumulator + i;
        else if(switchSelect == SW5)
            accumulator = (accumulator << 3) - accumulator + i;
        else if(switchSelect == SW6)
            accumulator = (i - accumulator) & 0xFF;
        else if(switchSelect == SW7)
            accumulator = i*i + accumulator;

    }
    printf("if-else time = %lf seconds\n", timer.seconds());
    printf("accumulated value: %d\n", accumulator);

    return 0;
}


Thank you guys for taking a look into these test cases.

That is a very sparse switch though, which LLVM should convert to the equivalent of a set of if-then's anyway, but apparently it does so in a way that's less efficient than the manual if-thens. Have you tried running wasm2wat to see how these two loops differ in code?

This also depends strongly on this test using the same value on each iteration. This test would be better if it cycled thru all values, or better yet, randomly picked from them (if that can be done cheaply).

Better yet, the real reason people use switch for performance is with a dense range, so you can guarantee it is actually using br_table underneath. Seeing at how many cases br_table is faster than if would be the most useful thing to know.

The switch in the tight loops were used because it was cleaner code over performance. But for wasm the performance impact was too large so it was converted to uglier if-statements. For image processing in alot of my use cases if I want more performance out of a switch, I would move the switch outside the loop, and simply have copies of the loop for each case. Usually the switch is just switching between some form of pixel format, color format, encoding and etc... And in many cases the constants are computed via defines or enums and not linear. I see now that my issue is not related to goto design. I just had an incomplete understanding about what was happening for my switch statements. I hope my notes is useful to browser devs reading this to optimize wasm for image processing in these cases. Thank you.

I never thought goto can be such a heated debate 😮 . I'm in the boat of every language should have a goto 😁 . Another reason to add goto is it reduces the complexity for the compiler to compile to wasm. I'm pretty sure that's mentioned above somewhere. Now I have nothing to complain about 😞 .

Any further progress there?

due to the heated debate I would assume some browser would add support for goto as a non-standard bytecode extension. Then maybe GCC can enter the game as supporting a non-standard version. Which I don't think is good overall but will allow more compiler competition. Has this been considered?

There hasn't been much progress lately, but you may want to look at the funclets proposal.

@graph to me, your suggestion sounds like "let's break everything, and hope for the better".
It doesn't work like that. There are MANY benefits from the current WebAssembly structure (that are not obvious, unfortunately). Try diving deeper into the philosophy of wasm.

Allowing "arbitrary Labels and Gotos" will bring us back to the (ancient) times of non-verifiable bytecode. All the compilers will just switch to a "lazy way" of doing things, instead of "doing it right".

It's clear that wasm in it's current state has some major omissions. People are working on filling the gaps (like the one mentioned by @binji ), but I don't think the "global wasm structure" needs to be reworked. Just my humble opinion.

@vshymanskyy The funclets proposal, which provides functionality equivalent to arbitrary labels and gotos, is fully validatable, in linear time.

I should also mention that in our linear-time Wasm compiler, we internally compile all Wasm control flow into a funclets-like representation, which I have some information about in this block post and the conversion from Wasm control flow to this internal representation is implemented here. The compiler gets all its type information from this funclets-like representation, so suffice to say it is trivial to validate the type-safety of it in linear time.

I think that this misconception that irreducible control flow cannot be validated in linear time stems from the JVM, where irreducible control flow must be executed using the interpreter instead of being compiled. This is because the JVM doesn't have any way to represent type metadata for irreducible control flow, and so it cannot do the conversion from stack machine to register machine. "Arbitrary gotos" (i.e. jump to byte/instruction X) is not verifiable at all, but separating a function into typed blocks, which can then be jumped between in an arbitrary order, is no harder to verify than separating a module into typed functions, which can then be jumped between in an arbitrary order. You do not need jump-to-byte-X-style untyped gotos to implement any useful patterns that would be emitted by compilers like GCC and LLVM.

I just love the process here. Side A explains why this is needed in specific applications. Side B says they're doing it wrong, but offers no support for that application. Side A explains how none of B's pragmatic arguments hold water. Side B doesn't want to deal with it because they think side A is doing it wrong. Side A is trying to accomplish a goal. Side B says that's the wrong goal, calling it lazy or brutish. The deeper philosophical meanings are lost on side A. The pragmatic is lost on side B, as they claim to have some sort of higher moral ground. Side A sees this as an amoral mechanistic operation. Ultimately, side B generally remains in control of the spec for better or for worse, and they've gotten incredible amount done with their relative purity.

Honestly, I just stuck my nose in here because years ago, I was trying to make a TinyCC port to WASM so I could run a development environment on an ESP8266 targeting the ESP8266. I only have ~4MB of storage, so including re-looper and switch to an AST, as well as many other changes is out of the question. (Side-note: How is relooper the only thing like relooper? It's sooo awful and no one's rewritten that sucker in In C!?) Even if it were possible at this point, I don't know if I would write a TinyCC target to WASM since it's just not as interesting to me any more.

This thread, though. Holy cow this thread has brought me so much existential joy. To watch a bifurcation in humanity run deeper than democrat or republican, or religion. I feel like if this can ever be resolved. If A can come to live in B's world, or B validate A's claim that procedural programming has its place... I feel like we could solve world peace.

Could someone in charge of V8 confirm in this thread that the opposition against irreducible control flow is not influenced by V8's current implementation?

I am asking because this is what bugs me most. To me it seems like this should be a discussion on the spec level about pros and cons of this feature. It should not at all be influenced by how a particular implementation is currently designed. However, there have been statements that make me believe that V8's implementation is influencing this. Maybe I am wrong. An open statement might help.

Well, as much as its unfortunate, the current implementations existing so far are so important that the future (presumably longer then the past) is not so important. I was trying to explain that at #1202, that the consistency is more important then the few implementations, but it seems I'm delusional. Good luck of explaining that some development decisions somewhere in some project are not constituting a universal truth, or have to be, by default, assumed correct.

This thread is one canary in the W3C coal mine. Though I have great respect for many W3C individuals, the decision to entrust JavaScript to Ecma International, and not the W3C, was not made without prejudice.

Like @cnlohr, I had hopes for a TCC wasm port, and for good reason;

"Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications." - webassembly.org

Sure, anyone can pontificate why goto is [INSERT JARGON], but how about we prefer standards over opinions. We can all agree POSIX C is a good baseline target, especially given today's langs are either wrought from or benchmarked against C and WASM's homepage headline touts itself a portable compilation target for langs. Sure, some features will be roadmapped like threads and simd. But, to wholly disregard something as fundamental as goto, to not even give it the decency of roadmapping, is not consistent with the WASM's stated purpose and such a stance from the standardization body that greenlit <marquee> is beyond the pale.

According to SEI CERT C Coding Standard Rec. MEM12-C titled, "Consider using a goto chain when leaving a function on error when using and releasing resources";

Many functions require the allocation of multiple resources. Failing and returning somewhere in the middle of this function without freeing all of the allocated resources could produce a memory leak. It is a common error to forget to free one (or all) of the resources in this manner, so a goto chain is the simplest and cleanest way to organize exits while preserving the order of freed resources.

The recommendation then offers an example with the preferred POSIX C solution using goto. The naysayers will point to the note that goto is still considered harmful. Interestingly, this opinion is not embodied in one of those particular coding standards, just a note. Which brings us to the canary, the "considered harmful".

Bottomline, a considering of "CSS Regions" or goto as harmful should only be weighed along with a proposed solution to the problem that such a feature is used for. If removing said "harmful" feature amounts to removing the reasonable use cases without alternative, that's not a solution, that's in fact harmful to the users of the language.

Functions are not zero cost, even in C. If someone offers a replacement to gotos & labels, canihaz please! If someone says I don't need it, how do they know that? When it comes to performance, goto can gives us that little extra, hard to argue to engineers that we don't need performant, easy-to-grok features that have existed since the dawn of the language.

Without a plan to support goto, WASM is a toy compilation target, and that's ok, maybe that's how the W3C sees the web. I hope WASM as a standard reaches higher, out of the 32bit address space, and enters the compilation race. I hope the engineering discourse can get away from "that's not possible..." to let's fast track GCC C extensions like Labels as Values because WASM should be AWESOME. Personally, TCC is considerably more impressive and more useful at this point, without all the wasted pontificating, without the hipster landing page and shiny logo.

@d4tocchini:

According to SEI CERT C Coding Standard Rec. MEM12-C titled, "Consider using a goto chain when leaving a function on error when using and releasing resources";

Many functions require the allocation of multiple resources. Failing and returning somewhere in the middle of this function without freeing all of the allocated resources could produce a memory leak. It is a common error to forget to free one (or all) of the resources in this manner, so a goto chain is the simplest and cleanest way to organize exits while preserving the order of freed resources.

The recommendation then offers an example with the preferred POSIX C solution using goto. The naysayers will point to the note that goto is still considered harmful. Interestingly, this opinion is not embodied in one those particular coding standards, just a note. Which brings us to the canary, the "considered harmful".

The example given in that recommendation can be directly expressed with labelled breaks, which are available in Wasm. It does not need the extra power of arbitrary goto. (C does not provide labelled break and continue, so has to fall back to goto more often than necessary.)

@rossberg, good point on labelled breaks in that example, but I disagree with your qualitative assumption that C must "fall back". goto is a richer construct than labelled breaks. If C is to be included among the portable compilation targets, and C does not support labelled breaks, that's rather a mute point. Java has labelled breaks/continues whereas Python rejected the proposed feature, and considering both the sun JVM and the default CPython are written in C, wouldn't you agree C as a supported language ought to be higher on the priority list?

If goto is to be so readily dropped from consideration, should the hundreds of uses of goto within emscripten's source be reconsidered as well?

Is there a language that can't be written in C? C as a language should be informing WASM's features. If POSIX C is not possible with today's WASM, then there's your proper roadmap.

Not really on the topic of the argument, but to not put shade that the random mistakes are lurking here and there in the argumentation in general:

Python have labelled breaks

Can you elaborate? (Aka: Python doesn't have labelled breaks.)

@pfalcon, yes my bad, I edited my comment to clarify python proposed labelled breaks/continues and rejected it

If goto is to be so readily dropped from consideration, should the hundreds of uses of goto within emscripten's source be reconsidered as well?

1) Note how much of that is present in musl libc, not directly in emscripten. (Second most used is tests/third_party)
2) Source level constructs are not the same as bytecode instructions
3) Emscripten is not at the same level of abstraction as the wasm standard, so, no it shouldn't be reconsidered on that basis.

Specifically, it might be useful today to rewrite the gotos out of libc, because then we'd have more control over the resulting cfg than trusting relooper/cfgstackify to handle it well. We haven't because it's a nontrivial amount of work to wind up with wildly divergent code from upstream musl.

Emscripten developers (last I checked) tend to be of the opinion that a goto-like structure would be really nice, for those obvious reasons, so are unlikely to drop it from consideration, even if it takes years to reach an acceptable compromise.

such a stance from the standardization body that greenlit <marquee> is beyond the pale.

This is a particularly asinine statement.

1) We-the-broader-Internet are over a decade away from having made that decision
2) We-the-wasm-CG are an entirely (nearly?) separate group of people from that tag, and are probably individually annoyed by obvious past mistakes as well.

without all the wasted pontificating, without the hipster landing page and shiny logo.

This could have been reworded to "I am frustrated" without running into tone issues.

As this thread shows, these conversations are difficult enough as it is.

There's a new level of deeply concerning when you want to rewrite a deeply trusted and understood set of functions for all new just because an environment for their use has to go through extra steps to support it. (though I am still in the firmly please-add-goto camp because I hate being tied to only using one specific compiler)

I think this thread moved way past being productive - it has been running for over four years now and it looks like every possible argument for and against arbitrary gotos has been used here; it should also be noted that none of those arguments are particularly new ;)

There are managed runtimes that chose not to have arbitrary jump labels, which worked out fine for them. Also, there are programming systems where arbitrary jumps are permitted and they are doing well too. In the end, authors of a programming system make design choices and only time really shows whether those choices are successful or not.

Wasm design choices that forbid arbitrary jumps are core to its philosophy. It is unlikely it can support gotos without something like funclets, for the same reasons it does not support pure indirect jumps.

Wasm design choices that forbid arbitrary jumps are core to its philosophy. It is unlikely it can support gotos without something like funclets, for the same reasons it does not support pure indirect jumps.

@penzn Why is the funclets proposal stuck? It exists since October 2018 and it is still in phase 0.

If we were discussing a run-of-the-mill open-source project, I'd fork & be done with it. We’re talking about a far-reaching monopoly standard here. Vigorous community response should be cultivated because we care.

@J0eCool 

  1. Note how much of that is present in musl libc, not directly in emscripten. (Second most used is tests/third_party)

Yes, the nod was to how much its used in C in general.

  1. Source level constructs are not the same as bytecode instructions

Of course, what we’re discussing is an internal concern that impacts source level constructs. That's part of the frustration, the black box should not leak its concerns.

  1. Emscripten is not at the same level of abstraction as the wasm standard, so, no it shouldn't be reconsidered on that basis.

The point was that you will find gotos in majority of sizeable C projects, even within WebAssembly toolchain at large. A portable compiler target for languages in general that is not expressive enough to target its own compilers is not exactly consistent with the nature of our enterprise.

Specifically, it might be useful today to rewrite the gotos out of libc, because then we'd have more control over the resulting cfg than trusting relooper/cfgstackify to handle it well.

This is circular. Many above have raised serious unanswered questions regarding the infallibility of such a requirement.

We haven't because it's a nontrivial amount of work to wind up with wildly divergent code from upstream musl.

It is possible to remove gotos, as you said, it’s nontrivial amount of work! Are you suggesting everyone else should wildy diverge code paths because gotos should not be supported?

Emscripten developers (last I checked) tend to be of the opinion that a goto-like structure would be really nice, for those obvious reasons, so are unlikely to drop it from consideration, even if it takes years to reach an acceptable compromise.

A glimmer of hope! I’d be satisfied if goto/label support were taken seriously with a roadmap item + official invitation to get the ball moving, even if years out.

This is a particularly asinine statement.

You’re right. Forgive the hyperbole, I am a bit frustrated. I love wasm, and use it often, but I ultimately see a road of pain in front of me if I want to do anything noteworthy with it, like port TCC. After reading all the comments and articles, I still can't figure if the opposition is technical, philosophical or political. As @neelance expressed,

“Could someone in charge of V8 confirm in this thread that the opposition against irreducible control flow is not influenced by V8's current implementation?

I am asking because this is what bugs me most. [...]

If you guys listen to any of use, take @neelance’s feedback regarding Go 1.11 to heart. That’s hard to argue with. Sure, we can all do the nontrivial dusting of goto, but even then, we take a serious perf hit that can only be fixed with a goto instruction.

Again, forgive my frustration, but if this issue is closed without proper address, then I’m afraid it will send the wrong kind of signal that will only exasperate these kind of community responses and is inappropriate for one of the greatest standards efforts of our field. It goes without saying that I am a huge fan and supporter of all on this team. Thank you!

Here is another real world issue that is caused by missing goto/funclets: https://github.com/golang/go/issues/42979

For this program, the Go compiler currently generates a wasm binary with 18,000 nested blocks. The wasm binary itself has a size of 2.7MB, but when I run it through wasm2wat I get a 4.7GB .wat file. 🤯

I could try to give the Go compiler some heuristic so instead of a single huge jump table it could create some kind of binary tree and then look at the jump target variable multiple times. But is this really how it is supposed to be with wasm?

I would like to add that I find it odd how people seem to think that it's perfectly fine if only a single compiler (Emscripten[1]) can realistically support WebAssembly.
Reminds me somewhat of the libopus situation (a standard that normatively depends on copyrighted code).

I also find it odd how WebAssembly devs seem to be so vehemently against this, despite just about everyone from the compiler end of things telling them it's required. Remember: WebAssembly is a standard, not a manifesto. And fact is, most modern compilers use some form of SSA + basic blocks internally (or something nearly equivalent, with the same properties), which have no concept of explicit loops[2]. Even JITs use something similar, that's how common it is.
The absolute requirement for relooping to happen with no escape hatch of "just use goto" is, to my knowledge[3], unprecedented outside of language-to-language translators --- and even then, only language-to-language translators that target goto-less languages. In particular, I have never heard of this having to be done for any sort of an IR or bytecode, other than WebAssembly.

Perhaps it's time to rename WebAssembly to WebEmscripten (WebScripten?).

As @d4tocchini said, if it wasn't for WebAssembly's (necessary, due to the standarization situation) monopolistic status, it would likely have been forked by now, into something that can reasonably support what the compiler developers already know it needs to support.
And no, "just use emscripten" is not a valid counter-argument, because it makes the standard depend on a single compiler vendor. I hope I don't need to tell you why that's bad.

EDIT: I forgot to add one thing:
You still haven't clarified on whether the issue is technical, philosophical, or political. I suspect the latter, but would gladly be proven wrong (because technical and philosophical issues can be fixed far more easily than political).

Here is another real world issue that is caused by missing goto/funclets: golang/go#42979

For this program, the Go compiler currently generates a wasm binary with 18,000 nested blocks. The wasm binary itself has a size of 2.7MB, but when I run it through wasm2wat I get a 4.7GB .wat file. 🤯

I could try to give the Go compiler some heuristic so instead of a single huge jump table it could create some kind of binary tree and then look at the jump target variable multiple times. But is this really how it is supposed to be with wasm?

This example is really interesting. How does such a simple straight line program generate this code? What's the relationship between the number of array elements and the number of blocks? In particular, should I interpret this as meaning that each array element access requires _multiple_ blocks to be compiled faithfully?

And no, "just use emscripten" is not a valid counter-argument

I think the real counter-argument in this vein would be that another compiler wanting to target Wasm can/must implement their own relooper-like algorithm. Personally, I do think Wasm should eventually have a multi-body loop (close to funclets) or something similar that's a natural target for goto.

@conrad-watt There are several factors that cause each assignment to use several basic blocks in the CFG. One of them is that there is a length check on the slice because the length is not known at compile time. Generally I would say that compilers consider basic blocks as a relatively cheap construct, but with wasm they are somewhat expensive, especially in this particular case.

@neelance in the modified example where the code is split between several functions, the (runtime/compilation) memory overhead is shown to be much lower. Are fewer blocks generated in this case, or is it just that the separate functions mean that engine GC can be more granular?

@conrad-watt It is not even the Go code that is using the memory, but the WebAssembly host: When I instantiate the wasm binary with Chrome 86, my CPU goes to 100% for 2 minutes and the memory usage of the tab peaks at 11.3 GB. This is before the wasm binary / Go code gets executed. It is the shape of the wasm binary that is causing the issue.

That was already my understanding. I'd expect a large number of blocks/type annotations to cause memory overhead specifically during compilation/instantiation.

To try and disambiguate my previous question - if the split version of the code compiles to Wasm with fewer blocks (because of some relooper quirk), that would be one explanation for the reduced memory overhead, and would be a good motivation for adding more general control flow to Wasm.

Alternatively, it may be that the split code results in (roughly) the same total number of blocks, but because each function is separately JIT compiled, the metadata/IR used to compile each function can be more eagerly GC'd by the Wasm engine. A similar issue occured in V8 years ago when parsing/compiling large asm.js functions. In this case, introducing more general control flow to Wasm would not solve the issue.

First I'd like to clarify: The Go compiler is not using the relooper algorithm, because it is inherently incompatible with the concept of switching goroutines. All basic blocks are expressed via a jump table with a bit of fall-through where possible.

I guess there is some exponential complexity growth in Chrome's wasm runtime with regards to the depth of nested blocks. The split version has the same number of blocks but a smaller maximum depth.

In this case, introducing more general control flow to Wasm would not solve the issue.

I agree that this complexity issue can probably be solved at Chrome's end. But I always like to ask the question "Why did this issue exist in the first place?". I would argue that with more general control flow, this issue would have never existed. Also, there is still the significant general performance overhead due to all basic blocks being expressed as jump tables, which I think is unlikely to go away by optimization.

I guess there is some exponential complexity growth in Chrome's wasm runtime with regards to the depth of nested blocks. The split version has the same number of blocks but a smaller maximum depth.

Does this mean that in a straight line function with N array accesses, the final array access will be nested (some constant factor of) N blocks deep? If so, is there a way to reduce this by factoring error-handling code differently? I'd expect any compiler to chug if it has to analyze 3000 nested loops (very rough analogy) so if this is unavoidable for semantic reasons, that would also be an argument for more general control flow.

If the nesting difference is less stark than that, my hunch would be that V8 does almost no GC'ing of metadata _during_ compilation of a single Wasm function, so even if we had something like a tweaked funclets proposal in the language right from the start, the same overheads would still be visible without them doing some interesting GC optimisation.

Also, there is still the significant general performance overhead due to all basic blocks being expressed as jump tables, which I think is unlikely to go away by optimization.

Agree that it's clearly preferable (from a purely technical standpoint) to have a more natural target here.

Does this mean that in a straight line function with N array accesses, the final array access will be nested (some constant factor of) N blocks deep? If so, is there a way to reduce this by factoring error-handling code differently? I'd expect any compiler to chug if it has to analyze 3000 nested loops (very rough analogy) so if this is unavoidable for semantic reasons, that would also be an argument for more general control flow.

The other way around: The first assignment is nested that deeply, not the last. Nested blocks and a single br_table at the top is how a traditional switch statement is expressed in wasm. This is the jump table I mentioned. There are no 3000 nested loops.

If the nesting difference is less stark than that, my hunch would be that V8 does almost no GC'ing of metadata during compilation of a single Wasm function, so even if we had something like a tweaked funclets proposal in the language right from the start, the same overheads would still be visible without them doing some interesting GC optimisation.

Yes, there may also be some implementation that has exponential complexity with regards to the number of basic blocks. But handling basic blocks (even in a large quantity) is what a lot of compilers do all day. For example the Go compiler itself handles this number of basic blocks easily during its compilation, even though they get processed by several optimization passes.

Yes, there may also be some implementation that has exponential complexity with regards to the number of basic blocks. But handling basic blocks (even in a large quantity) is what a lot of compilers do all day. For example the Go compiler itself handles this number of basic blocks easily during its compilation, even though they get processed by several optimization passes.

Sure, but a performance issue here would be orthogonal to how control flow between those basic blocks is expressed in the original source language (i.e. not a motivation for more general control flow in Wasm). To see if V8 is particularly bad here, one could check whether FireFox/SpiderMonkey or Lucet/Cranelift exhibit the same compilation overheads.

I have done some more testing: Firefox and Safari show no issues at all. Interestingly, Chrome is even able to run the code before the intensive process has finished, so it seems like some task not strictly necessary for running the wasm binary is having the complexity issue.

Sure, but a performance issue here would be orthogonal to how control flow between those basic blocks is expressed in the original source language.

I see your point.

I still believe that representing basic blocks not via jump instructions but via a jump variable and a huge jump table / nested blocks is expressing the simple concept of basic blocks in a quite complex way. This leads to performance overhead and a risk of complexity issues such as the one we saw here. I believe that simpler systems are better and more robust than complex systems. I still haven't seen arguments that convince me that the simpler system is a bad choice. I have only heard that V8 would have a hard time implementing arbitrary control flow and my open question to tell me that this statement is wrong (https://github.com/WebAssembly/design/issues/796#issuecomment-623431527) has not been answered yet.

@neelance

Chrome is even able to run the code before the intensive process has finished

It sounds like the baseline compiler Liftoff is ok, and the problem is in the optimizing compiler TurboFan. Please file an issue, or please provide a testcase and I can file one if you prefer.

More generally: Do you think the wasm stack switching plans will be able to solve Go's goroutine implementation issues? That's the best link I can find, but it is quite active now, with a bi-weekly meeting, and several strong use cases that motivate the work. If Go can use wasm coroutines to avoid the big switch pattern then I think arbitrary gotos would not be necessary.

The Go compiler is not using the relooper algorithm, because it is inherently incompatible with the concept of switching goroutines.

It's true that it can't be applied by itself. However, we have good results with using wasm structured control flow + Asyncify. The idea there is to emit normal wasm control flow as much as possible - ifs, loops, etc., without a single big switch - and to add instrumentation on top of that pattern to handle unwinding and rewinding the stack. This leads to fairly small code size, and non-stack switching code can run at basically full speed, while an actual stack switch can be somewhat slower (so this is good for the case where stack switches are not constantly happening on each loop iteration etc.).

I'd be very happy to experiment with that on Go, if you're interested! This would obviously not be as good as built-in stack switching support in wasm, but it could be better than the big switch pattern already. And it would be easier to switch to the built-in stack switching support later. Concretely, how this experiment could work is to make Go emit normally structured code, without worrying about stack switching at all, and just emit a call to a special maybe_switch_goroutine function at appropriate points. The Asyncify transform would take care of all the rest basically.

I'm interested in gotos for dynamic recompiling emulators such as qemu. Unlike other compilers, qemu at no point has knowledge of the program control flow structure, and so gotos are the only reasonable target. Tailcalls might address this, by compiling each block as a function and each goto as a tailcall.

@kripken Thanks for your very helpful post.

It sounds like the baseline compiler Liftoff is ok, and the problem is in the optimizing compiler TurboFan. Please file an issue, or please provide a testcase and I can file one if you prefer.

Here's a wasm binary that you can run with wasm_exec.html.

Do you think the wasm stack switching plans will be able to solve Go's goroutine implementation issues?

Yes, at first glance it seems like this would help.

However, we have good results with using wasm structured control flow + Asyncify.

This looks promising as well. We would need to implement the relooper in Go, but that's fine I guess. One small downside is that it adds a dependency to binaryen for producing wasm binaries. I'll probably write a proposal soon.

I believe LLVM's stackifier algorithm is easier/better, in case you want to implement that: https://medium.com/leaningtech/solving-the-structured-control-flow-problem-once-and-for-all-5123117b1ee2

I have filed a proposal for the Go project: https://github.com/golang/go/issues/43033

@neelance, nice to see @kripken's suggestion helps a bit with golang + wasm. Considering this issue is one of goto/labels not stack switching, and given that Asyncify introduces new deps / special casing builds with Asyncify until stack switching is released, etc -- would you characterize this as a solution or less than optimal mitigation? How does this compare to estimated benefits if goto instructions were available?

If Linus Torvalds' “Good Taste” argument for linked lists rests on the elegance of removing a sole special-cased branch statement, it's hard to see this kind of special-casing gymnastics as a win or even a step in the right direction. Having personally used gotos for async-like apis in C, to talk about stack switching before goto instructions triggers all kinds of smells.

Please correct me if I'm misreading, but apart from seemingly flyby responses focused on marginal particularities to some questions raised, it appears the maintainers here haven't offered any clarity on the matter at hand nor answered the hard questions. With all due respect, is not this sluggish ossification the hallmark of callus corporate politic? If this is the case, I understand the plight... Imagine all the languages/compilers that Wasm's brand could boast support if only ANSI C was a compat litmus test!

@neelance @darkuranium @d4tocchini not all Wasm contributors think the lack of goto's is the right thing, in fact, I'd personally rate it as Wasm's #1 design mistake. I am absolutely in favor of adding it (either as funclets or directly).

However, debating on this thread is not going to make gotos happen, and not magically going to make everyone involved in Wasm convinced and do the work for you. Here are the steps to take:

  1. Join the Wasm CG.
  2. Someone invest the time to become champion of a goto proposal. I recommend starting from the existing funclets proposal, as it has already been well-thought out by @sunfishcode to be the "least intrusive" to current engines & tools that rely on block structure, so it has a higher chance to succeed than a raw goto.
  3. Help it being pushed through the 4 proposal stages. This includes making good designs for whatever objections get thrown your way, initiating discussions, with the objective to get enough people happy such that you get majority votes when advancing thru the stages.

@d4tocchini Honestly, I currently see the suggested solutions as "the best way forward given circumstances that I can't change" aka "workaround". I still consider jump/goto instructions (or funclets) as the simpler way and thus preferable. (Still thanks to @kripken for helpfully suggesting the alternatives.)

@aardappel As far as I know, @sunfishcode tried to push the funclets proposal and failed. Why would it be different for me?

@neelance I don't think @sunfishcode has had much time to push the proposal beyond its initial creation, so it is "stalled" rather than "failed". As I was trying to indicate, it requires a champion to do continuous work for a proposal to get all the way thru the pipeline.

@neelance

Thanks for the testcase! I can confirm the same problem locally. I filed https://bugs.chromium.org/p/v8/issues/detail?id=11237

We would need to implement the relooper in Go [..] One small downside is that it adds a dependency to binaryen for producing wasm binaries.

Btw, if it would help, we can make a library build of binaryen as a single C file. Maybe that's easier to integrate?

Also, using Binaryen you can use the Relooper implementation that is there. You can pass it basic blocks of IR and let it do the relooping.

@taralx

I believe LLVM's stackifier algorithm is easier/better,

Note that that link isn't about upstream LLVM, that's the Cheerp compiler (which is a fork of LLVM). Their Stackifier has a similar name to LLVM's, but is different.

Note also that that Cheerp post refers to the original algorithm from 2011 - the modern relooper implementation (as mentioned earlier) hasn't had the problems they mention for many years. I'm not aware of a simpler or better alternative to that general approach, which is very similar to what Cheerp and others do - these are variations on a theme.

@kripken Thanks for filing the issue.

Btw, if it would help, we can make a library build of binaryen as a single C file. Maybe that's easier to integrate?

Unlikely. The Go compiler itself has been converted to pure Go a while ago and afaik it uses no other C dependencies. I don't think that this will be an exception.

Here's the current state of the funclets proposal: The next step in the process is to call for a CG vote to enter stage 1.

I myself am currently focused on other areas in WebAssembly and don't have the bandwidth to push funclets forward; if anyone is interested in taking over the Champion role for funclets, I'd be happy to hand it off.

Unlikely. The Go compiler itself has been converted to pure Go a while ago and afaik it uses no other C dependencies. I don't think that this will be an exception.

Besides, this doesn't solve the problem of extensive use of relooper causing serious performance cliffs in the WebAssembly runtimes.

@Vurich

I think that could be the best case for adding gotos to wasm, but someone would need to collect compelling data from real-world code showing such serious perf cliffs. I haven't seen such data myself. Work analyzing wasm perf deficits like "Not So Fast: Analyzing the Performance of WebAssembly vs. Native Code" (2019) doesn't support control flow being a significant factor either (they do note a larger amount of branch instructions, but those are not due to structured control flow - rather it's due to safety checks).

@kripken Do you have any suggestions on how one could collect such data? How would one show that a perf deficit is due to structured control flow?

Unlikely that there's much work analyzing the performance of the compilation stage, which is part of the complaint here.

I'm somewhat surprised that we don't have a switch case construct yet, but funclets subsume that.

@neelance

It's not easy to figure out the specific causes, yeah. For e.g. bounds checks you can just disable them in the VM and measure that, but there isn't a simple way to do the same for gotos, sadly.

One option is to compare the emitted machine code by hand, which is what they did in that linked paper.

Another option is to compile the wasm to something that you believe can handle control flow optimally, that is, "undo" the structuring. LLVM should be able to do that, so running wasm in a VM that uses LLVM (like WAVM or wasmer) or through WasmBoxC could be interesting. You could perhaps disable CFG optimizations in LLVM and see how much that matters.

@taralx

Interesting, did I miss something about compile times or memory usage? Structured control flow should actually be better there - e.g. it's very simple to go to SSA form from that, compared to from a general CFG. This was in fact one of the reasons wasm went with structured control flow in the first place. That is also measured very carefully because it affects load times on the Web.

(Or do you mean compiler performance on the developer's machine? It's true that wasm does lean in the direction of doing more work there, and less on the client.)

I meant compile performance in the embedder, but it seems that that is being treated as a bug, not necessarily a pure perf issue?

@taralx

Yes, I think that's a bug. It just happens in one tier on one VM. And there is no fundamental reason for it - structured control flow doesn't require more resources, it should require fewer. That is, I'd bet such perf bugs would be more likely to happen if wasm did have gotos.

@kripken

Structured control flow should actually be better there - e.g. it's very simple to go to SSA form from that, compared to from a general CFG. This was in fact one of the reasons wasm went with structured control flow in the first place. That is also measured very carefully because it affects load times on the Web.

A very specific question, just in case: Do you know of any Wasm compiler which actually does that - "very simple" going from "structured control flow" to SSA form. Because from a quick look, Wasm's control flow is not that (fully/ultimately) structured. Formally structured control is the one where there's no breaks, continues, returns (roughly, Scheme's model of programming, without magic like call/cc). When those are present, such a control flow roughly can be called "semi-structured".

There's a well-know SSA algo for fully structured control flow: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.45.4503 . Here's what it has to say about semi-structured control flow:

For structured statements we have shown how to generate both SSA form and the dominator tree in a single pass during parsing. In the following section we will show that it is even possible to extend our method to a certain class of unstructured statements (LOOP/EXIT and RETURN) that may cause exits from control structures at arbitrary points. However, since such exits are a kind of (disciplined) goto it is not surprising that they are much harder to handle than structured statements.

OTOH, there's another well-known algorithm, https://pp.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf which arguably also single-pass, but doesn't have problems not just with non-structured control flow, but even with irreducible control flow (albeit doesn't produce optimal result for it).

So, the question is again whether you know that some project went thru the trouble of actually extending Brandis/Mössenböck algorithm, and achieved tangible benefits on that route comparing to Braun et al. algorithm (as a side note, my intuitive hunch is that Braun algo is exactly such an "upper bound" extension, though I'm too dumb to prove it intuitively to myself, not speaking about a formal proof, so that's it - intuitive hunch).

And the general theme of the question is to establish (though I'd say "maintain") the ultimate reason of why Wasm opted out of the arbitrary goto support. Because watching this thread for years, the mental model I built is that it's done to avoid facing irreducible CFGs. And indeed, the chasm lies between reducible and irreducible CFGs, with many optimization algorithms being (much) easier for the reducible CFGs, and that's what many optimizers have coded in. (Semi)structured control flow in Wasm is just a cheap way to guarantee reducibility.

Mentioning of any special easiness of SSA production for structured CFG (and Wasm CFGs doesn't really seem to be structured in the formal sense) somehow clouds the clear picture above. That's why I'm asking if there're specific references that SSA construction is practically benefited by the Wasm CFG form.

Thanks.

@kripken I am a bit confused right now and eager to learn. I am looking at the situation and I currently see the following:


The source of your program has a certain control flow. This CFG is either reducible or it is not, for example goto has been used in the source language or not. There is no way to change this fact. This CFG can be turned into machine code, e.g. as the Go compiler natively does.

If the CFG is already reducible, then all is well and the wasm VM can load it quickly. Any translation pass should be able to detect that this is the simple case and do the quick thing. Allowing irreducable CFGs should not slow down this case.

If the CFG is not reducible, then there are two options:

  • The compiler makes it reducible, e.g. by introducing a jump table. This step loses information. It is hard to restore the original CFG without having an analysis that is specific to the compiler that produced the binary. Because of this loss of information, any machine code generated will be somewhat slower than the code generated from the initial CFG. We may be able to generate this machine code with a single pass algorithm, but it is at the cost of information loss. [1]

  • We allow the compiler to emit an irreducible CFG. The VM may have to make it reducible. This slows down the load time, but only in cases where the CFG is actually not reducible. The compiler has the option to choose between optimizing for load time performance or run time performance.

[1] I am aware that it is not really a loss of information if there is still some way to reverse the operation, but I couldn't describe it in a better way.


Where is the flaw in my thinking?

@pfalcon

Do you know of any Wasm compiler which actually does that - "very simple" going from "structured control flow" to SSA form.

About VMs: I don't know directly. But IIRC back in the day @titzer and @lukewagner said it was convenient to implement that way - perhaps one of them can elaborate. I'm not sure if irreducibility was the whole issue there or not. And I'm not sure if they've implemented those algorithms you mention or not.

About things other than VMs: The Binaryen optimizer definitely benefits from wasm's structured control flow, and not just that it is reducible. Various optimizations are simpler because we always know where loop headers are, for example, which are annotated in the wasm. (OTOH other optimizations are harder to do, and we do have a general CFG IR as well for those...)

@neelance

If the CFG is already reducible, then all is well and the wasm VM can load it quickly. Any translation pass should be able to detect that this is the simple case and do the quick thing. Allowing irreducable CFGs should not slow down this case.

Maybe I'm not fully understanding you. But that a wasm VM can load code quickly depends not only on whether it is reducible or not, but on how it is encoded. Specifically, we could have imagined a format that is a general CFG, and then the VM needs to do the work to verify it is reducible. Wasm opted to avoid that work - the encoding is necessarily reducible (that is, as you read the wasm and do the trivial validation, you are also proving it is reducible without doing any extra work).

In addition, wasm's encoding doesn't just give a guarantee of reducibility without needing to verify that. It also annotates loop headers, ifs, and other useful things (as I happened to mention separately earlier in this comment). I'm not sure offhand how much production VMs benefit from that, but I'd expect they do. (Perhaps especially in baseline compilers?)

Overall I think that allowing irreducible CFGs can slow down the fast case, unless irreducible ones are encoded in a separate way (like funclets are proposed to).

@kripken

Thanks for your explanation.

Yes, this is exactly the differentiation I am trying to make: I see the advantage of the structured notation/encoding for the reducible CFG case. But it shouldn't be hard to add some construct that allows the notation of an irreducible CFG and still keep the existing advantages in case of a reducible source CFG (for example if you don't use this new construct, then the CFG is guaranteed to be reducible).

As a conclusion I don't see how one can argue that a purely reducible notation is faster. In the case of a reducible source CFG it is just equally fast. And in the case if an irreducible source CFG one can at most argue that it is not significantly slower, but a few real world cases have already shown that this is unlikely the case in general.

In short, I don't see how performance considerations can be an argument that prevents irreducible control flow and that makes me question why the next step needs to be collecting perf data.

@neelance

Yes, I agree that we could add a new construct - like funclets - and by not using it, it wouldn't slow down the existing case.

But there is a downside to adding any new construct since it adds complexity to wasm. In particular it means a larger surface area on VMs which means more possible bugs and security issues. Wasm has leaned towards having as much complexity on the developer's side where possible in order to reduce complexity on the VM.

Some wasm proposals are not just about speed, like GC (which allows cycle collection with JS). But for proposals that are about speed, like funclets, we need to show that the speed justifies the complexity. We had this debate about SIMD which is also about speed, and decided it was worth it because we saw it can reliably achieve very large speedups on real-world code (2x or even more).

(There are other benefits than speed to allowing general CFGs, I'd agree, like making it easier for compilers to target wasm. But we can solve that without adding complexity to wasm VMs. We already provide support for arbitrary CFGs in LLVM and Binaryen, allowing compilers to emit CFGs and not worry about structured control flow. If that's not good enough, we - tools people I mean, including me - should do more.)

Funclet's aren't about speed as much as they're about allowing languages with non-trivial control flow to compile to WebAssembly, C and Go being the most obvious but it applies to any language that has async/await. Also, the choice to have hierarchical control flow actually leads to _more_ bugs in VMs, as evidenced by the fact that all Wasm compilers other than V8 decompose the hierarchical control flow to a CFG anyway. The EBBs in a CFG can represent the multiple control flow constructs in Wasm and more, and having a single construct to compile leads to far fewer bugs than having many different kinds with different uses.

Even Lightbeam, a very simple, streaming compiler, saw a massive decrease in miscompilation bugs after adding an extra translation step that decomposed the control flow to a CFG. This goes double for the other side of this process - Relooper is far more error-prone than emitting funclets, and I've been told by developers working on the Wasm backends for LLVM and other compilers that were funclets to be implemented they would emit every function using funclets alone, in order to improve the reliability and simplicity of codegen. All of the compilers producing Wasm use EBBs, all but one of the compilers consuming Wasm use EBBs, this refusal to implement funclets or some other way of representing CFGs is simply adding a lossy step in between that harms all parties involved other than the V8 team.

"Irreducible control flow considered harmful" is merely a talking point, you can easily add the restriction that funclets' control flow be reducible and then if you want to allow irreducible control flow in the future all existing Wasm modules with reducible control flow would work unmodified on an engine that additionally supports irreducible control flow. It would simply be a case of removing the reducibility check in the validator.

@Vurich

You can easily add the restriction that funclets' control flow be reducible

You can, but it's not trivial - VMs would need to verify that. I don't think that's possible in a single linear pass, which would be a problem for baseline compilers, which are now present in most VMs. (In fact, just finding loop backedges - which is a simpler problem, and necessary for other reasons too - can't be done in a single forward pass, can it?)

all Wasm compilers other than V8 decompose the hierarchical control flow to a CFG anyway.

Are you referring to the "sea of nodes" approach that TurboFan uses? I'm not an expert on that so I'll leave it to others to respond.

But more generally, even if you don't buy the above argument for optimizing compilers, it's even more directly true for baseline compilers, as mentioned earlier.

Funclet's aren't about speed as much as they're about allowing languages with non-trivial control flow to compile to WebAssembly [..] Relooper is far more error-prone than emitting funclets

I agree 100% on the tools side. It is harder to emit structured code from most compilers! But the point is that it makes it simpler on the VM side, and that's what wasm chose to do. But again, I agree that this has tradeoffs, including the downsides you mentioned.

Did wasm get this wrong back in 2015? It's possible. I think we got some things wrong myself (like debuggability, and the late switch to a stack machine). But it's not possible to fix those in retrospect, and there is a high bar for adding new things, especially overlapping ones.

Given all that, trying to be constructive, I think we should fix existing issues on the tools side. There is a much, much lower bar for tools changes. Two possible suggestions:

  • I can look into porting the Binaryen CFG code to Go, if that would help the Go compiler - @neelance ?
  • We can implement funclets or something like them purely on the tools side. That is, we provide library code for this today, but could also add a binary format. (There is already precedent for adding to the wasm binary format on the tools side, in wasm object files.)

We can implement funclets or something like them purely on the tools side. That is, we provide library code for this today, but could also add a binary format. (There is already precedent for adding to the wasm binary format on the tools side, in wasm object files.)

If there's any concrete work done on this, it's worth noting that (AFAIU) the smallest idiomatic way to add this to Wasm (as @rossberg alluded to) would be to introduce the block instruction

multiloop (tin)_n_ tout (_instr_* end)_n_

which defines n labelled bodies (with n input type annotations forward-declared). The br family of instructions are then generalised so that all labels defined by the multiloop are in-scope within each body, in-order (as in, any body can be branched to from within any other body). When a multiloop body is branched to, execution jumps to the _start_ of the body (like a regular Wasm loop). When execution reaches the end of a body without branching to another body, the whole construct returns (no fall-through).

There would be some bikeshedding to be done about how to efficiently represent the type annotations of each body (in the formulation above, n bodies can have n different input types, but must all have the same output type, so I can't directly use regular multi-value _blocktype_ indices without requiring a superfluous-feeling LUB calculation), and how to select the initial body to be executed (always the first, or should there be a static parameter?).

This gets the same level of expressivity as funclets but avoids having to introduce a new space of control instructions. In fact if funclets had been iterated on further I think it would have turned into something like this.

EDIT: tweaking this to have fall-through behaviour would marginally complicate the formal semantics, but would probably be better for @neelance's use case, and could help hint to a baseline compiler what the on-trace control flow path is.

The Wasm design principle of offloading work onto the tools to make engines simpler/faster is very important, and will continue to be very beneficial.

That said, like everything non-trivial, its a trade-off, not black and white. I believe here we have a case where the pain for producers is disproportional to the pain for engines. Most compilers we'd like to bring to Wasm either use arbitrary CFG structures internally (SSA) or are used to target things that don't mind gotos (CPUs). We're making the world jump thru hoops for not a lot of gain.

Something like funclets (or multiloop) is nice because it is modular: if a producer doesn't need it then things will work as before. If an engine really can't deal with arbitrary CFGs then for the moment they can emit it as if it was a loop + br_table kind of construct, and only those that use it pay the price. Then, "the market decides" and we see if there is pressure on engines to emit better code for it. Something tells me that if there's going to be a lot of Wasm code that relies on funclets, it really isn't going to be as big of a disaster for engines to emit good code for them as some people seem to think.

You can, but it's not trivial - VMs would need to verify that. I don't think that's possible in a single linear pass, which would be a problem for baseline compilers, which are now present in most VMs.

Maybe I'm misunderstanding the expectations for a baseline compiler, but why would they care? If you see a goto, insert a jump instruction.

I agree 100% on the tools side. It is harder to emit structured code from most compilers! But the point is that it makes it simpler on the VM side, and that's what wasm chose to do. But again, I agree that this has tradeoffs, including the downsides you mentioned.

No, as I say multiple times in my original comment, it _does not_ make things easier on the VM side. I worked on a baseline compiler for over a year and my life got easier and the emitted code got faster after I added an interim step that converted Wasm's control flow to a CFG.

You can, but it's not trivial - VMs would need to verify that. I don't think that's possible in a single linear pass, which would be a problem for baseline compilers, which are now present in most VMs. (In fact, just finding loop backedges - which is a simpler problem, and necessary for other reasons too - can't be done in a single forward pass, can it?)

Ok here's the thing, my knowledge of the algorithms used in compilers isn't strong enough to state with absolute certainty that irreducible control flow can or cannot be detected in a streaming compiler, but the thing is that it doesn't need to be. Verification can happen in tandem with compilation. If a streaming algorithm does not exist, which neither you nor I know that it doesn't, you can use a non-streaming algorithm once the function has been received fully. If (for some reason) irreducible control flow leads to something truly bad like an infinite loop, you can simply timeout the compilation and/or cancel the compilation thread. However, there is no reason to believe that this would be the case.

Maybe I'm misunderstanding the expectations for a baseline compiler, but why would they care? If you see a goto, insert a jump instruction.

It's not that simple because of how you need to map the infinite register machine of Wasm (no, it's not a stack machine) to the finite registers of the physical hardware, but that's a problem that any streaming compiler must solve and it's entirely orthogonal to CFGs vs hierarchical control flow.

The streaming compiler that I worked on can compile an arbitrary - even irreducible - CFG just fine. It's not doing anything particularly special. You simply assign each block a "calling convention" (basically the place where the values in scope in that block should be) when you first need to jump to it, and if you ever get to a point where you need to conditionally branch to two or more targets with incompatible "calling conventions" you push an "adapter" block to a queue and emit it at the next possible point. This can happen both with reducible and irreducible control flow, and it's almost never necessary in either case. The "irreducible control flow considered harmful" argument, as I said before, is a talking point and not a technical argument. Representing control flow as a CFG makes streaming compilers far easier to write, and as I have said multiple times, I know this from extensive personal experience.

Any cases in which irreducible control flow makes implementations harder to write, of which I can think of none, can just be stubbed out and return an error, and if you need a separate, non-streaming algorithm to detect for 100% certain that control flow is irreducible (so you don't accidentally accept irreducible control flow) then that can run separately from the baseline compiler itself. I've been told by someone who I have reason to believe is an authority on the subject (although I'll avoid invoking them because I know they don't want to be dragged into this thread) that there exists a relatively simple streaming algorithm for detecting irreducibility of a CFG, but I cannot say first-hand that this is true.

@oridb

Maybe I'm misunderstanding the expectations for a baseline compiler, but why would they care? If you see a goto, insert a jump instruction.

Baseline compilers still need to do things like insert extra checks at loop backedges (that's how on the Web a hanging page will show a slow script dialogue eventually), so they need to identify things like that. Also, they do try to do reasonably-efficient register allocation (baseline compilers often run at about 1/2 the speed of the optimizing compiler - which is very impressive given they are single-pass!). Having the structure of control flow, including joins and splits, makes that much easier.

@gwvo

That said, like everything non-trivial, its a trade-off, not black and white. [..] We're making the world jump thru hoops for not a lot of gain.

Totally agree it's a tradeoff, and even maybe wasm got it wrong back then. But I believe it is much more practical to fix those hoops on the tools side.

Then, "the market decides" and we see if there is pressure on engines to emit better code for it.

This is actually something we have avoided so far. We've tried to make wasm as simple as possible on the VM so it doesn't require complex optimizations - not even things like inlining, as much as possible. The goal is to do the hard work on the tools side, not to pressure VMs to do better.

@Vurich

I worked on a baseline compiler for over a year and my life got easier and the emitted code got faster after I added an interim step that converted Wasm's control flow to a CFG.

Very interesting! Which VM was that?

I'd also be specifically curious if it was a single-pass/streaming or not (if it was, how did it handle loop backedge instrumentation?), and how it does register allocation.

In principle, both loop backedges and register allocation can be handled based on linear instruction order, in the expectation that basic blocks will be put in some reasonable topsort-like order, without strictly requiring it.

For loop backedges: Define a backedge as an instruction that jumps to earlier in the instruction stream. At worst, if the blocks are laid out backwards, you get more backedge checks than strictly needed.

For register allocation: This is just standard linear scan register allocation. A variable's lifetime for register allocation spans from the first mention of the variable to the last mention, including all blocks that are linearly in between. At worst, if the blocks are shuffled around, you get longer lifetimes than needed and thus unnecessarily spill things to the stack. The only extra cost is tracking the first and last mention of each variable, which can be done for all variables with a single linear scan. (For wasm I suppose a "variable" is either a local or a stack slot.)

@kripken

I can look into porting the Binaryen CFG code to Go, if that would help the Go compiler - @neelance ?

For integrating Asyncify? Please comment on the proposal.

@comex

Good points!

The only extra cost is tracking the first and last mention of each variable

Yes, I think that's a significant difference. Linear scan register allocation is better (but slower to do) than what wasm baseline compilers currently do, as they compile in a streaming manner which is really fast. That is, there is no initial step to find the last mention of each variable - they compile in a single pass, emitting code as they go without even seeing code later on in the wasm function, helped by the structure, and also they make simple choices as they go ("stupid" is the word used in that post).

V8's streaming approach to register allocation should work just as well if blocks are allowed to be mutually recursive (as in https://github.com/WebAssembly/design/issues/796#issuecomment-742690194), since the only lifetimes they deal with are bound within a single block (stack) or assumed to be function-wide (local).

IIUC (with reference to @titzer's comment) the main issue for V8 lies in the kind of CFGs that Turbofan can optimise.

@kripken

We've tried to make wasm as simple as possible on the VM so it doesn't require complex optimizations

This is not a "complex optimisation".. gotos are incredibly basic and natural to a lot of systems. I bet there's a lot of engines who would be able to add this at no cost. All I am saying that if there's engines that want to hold on to a structured CFG model for whatever reason, they can.

For example, I'm pretty sure LLVM (by far our #1 Wasm producer currently) will not switch to using funclets until its confident that its not a performance regression in major engines.

@kripken It's part of Wasmtime. Yes, it's streaming and was intended to be O(N) complexity but I moved to a new company before that was fully realised so it's only "O(N)-ish". https://github.com/bytecodealliance/wasmtime/tree/main/crates/lightbeam

Thanks @Vurich , interesting. It would be great to see perf numbers when those are available, especially for startup but also throughput. I would guess that your approach would compile more slowly than the approach taken by the V8 and SpiderMonkey engineers, while emitting faster code. So it's a different tradeoff in this space. It does seem plausible that your approach does not benefit from wasm's structured control flow, as you said, while theirs does.

No, it’s a streaming compiler and emits code faster than either of those two engines (although there are degenerate cases that weren’t fixed at the time that I left the project). While I did my best to emit fast code, it’s primarily designed to emit code quickly with the efficiency of the output being of secondary concern. Startup cost is, to my knowledge, zero (above Wasmtime's inherent cost that is shared between backends) because every data structure starts uninitialised and compilation is done instruction-by-instruction. While I don’t have numbers to compare to V8 or SpiderMonkey to hand, I do have numbers to compare to Cranelift (the primary engine in wasmtime). They’re several months out of date at this point but you can see that not only does it emit code faster than Cranelift, it also emits faster code than Cranelift. At the time, it also emitted faster code than SpiderMonkey, although you’ll have to take my word for that so I won’t blame you if you don’t believe me. While I don’t have more-recent numbers to hand, I believe that the state now is that both Cranelift and SpiderMonkey fixed the small handful of bugs that were the main source of their low-performing output in these microbenchmarks when compared to Lightbeam, but the compilation speed differential didn’t change the whole time I was on the project because each compiler is still fundamentally architected the same, and it’s the respective architecture that leads to the different levels of performance. While I appreciate your speculation, I don’t know where your assumption that the method that I outlined would be slower comes from.

Here are the benchmarks, the ::compile benchmarks are for compilation speed and the ::run benchmarks are for execution speed of the machine code output. https://gist.github.com/Vurich/8696e67180aa3c93b4548fb1f298c29e

The methodology is here, you can clone it and rerun the benchmarks to confirm the results for yourself but the PR will likely be incompatible with the latest version of wasmtime so it'll only show you the comparison of performance at the time I last updated the PR. https://github.com/bytecodealliance/wasmtime/pull/1660

That being said, my argument is _not_ that CFGs are a useful internal representation for performance in a streaming compiler. My argument is that CFGs don’t negatively affect performance in any compiler, and certainly not to the level that would justify entirely locking the GCC and Go teams out of producing WebAssembly at all. Almost no-one in this thread arguing against funclets or a similar extension to wasm has actually worked on the projects that they claim will be negatively affected by this proposal. Not to say that you need first-hand experience to comment on this topic at all, I think everyone has some level of valuable input, but it is to say that there’s a line between having a different opinion on the colour of the bikeshed and making claims based on nothing more than idle speculation.

@Vurich

No, it’s a streaming compiler and emits code faster than either of those two engines (although there are degenerate cases that were never fixed because I left the project).

Sorry if I wasn't clear enough earlier. To be sure we're talking about the same thing, I meant the baseline compilers in those engines. And I am talking about compile time, which is the point of baseline compilers in the sense that V8 and SpiderMonkey use the term.

The reason I am skeptical you can beat V8 and SpiderMonkey baseline compile times is because, as in the links I gave earlier, those two baseline compilers are extraordinarily tuned for compile time. In particular they don't generate any internal IR, they just go straight from wasm to machine code. You said that your compiler does emit an internal IR (for a CFG) - I'd expect your compile times to be slower just because of that (due to more branching, memory bandwidth, etc.).

But please benchmark against those baseline compilers! I'd love to see data showing my guess is wrong, and I'm sure so would the V8 and SpiderMonkey engineers. It would mean you've found a better design that they should consider adopting.

To test against V8, you can run d8 --liftoff --no-wasm-tier-up, and for SpiderMonkey you can run sm --wasm-compiler=baseline.

(Thank you for the instructions for comparing to Cranelift, but Cranelift isn't a baseline compiler, so comparing compile times against it isn't relevant in this context. Very interesting otherwise though, I agree.)

My intuition is that baseline compilers would not have to significantly change their compilation strategy to support funclets/multiloop, since they don't attempt to do meaningful inter-block optimisation anyway. The relied-on "structure of control flow, including joins and splits" referenced by @kripken is satisfied by requiring all input types for a collection of mutually-recursive blocks to be forward-declared (which seems the natural choice for streaming validation anyway). Whether Lightbeam/Wasmtime can beat engine baseline compilers doesn't factor in to this; the important point is whether engine baseline compilers can stay just as fast as they are now.

FWIW, I would be interested to see this feature brought up for discussion in a future CG meeting, and I broadly agree with @Vurich that engine reps can object for themselves if they aren't prepared to implement it. That being said, we should take any such objection seriously (I've previously opined at in-person meetings that in pursuing this feature we should try to avoid a WebAssembly version of the JavaScript Proper Tail Calls saga). I'd be happy to make the first move on such a CG discussion myself in the new year once I've finished my (currently very very late) thesis submission.

@kripken

Yes, I think that's a significant difference. Linear scan register allocation is better (but slower to do) than what wasm baseline compilers currently do, as they compile in a streaming manner which is really fast. That is, there is no initial step to find the last mention of each variable - they compile in a single pass, emitting code as they go without even seeing code later on in the wasm function, helped by the structure, and also they make simple choices as they go ("stupid" is the word used in that post).

Wow, that really is very simple.

On the other hand… that particular algorithm is so simple that it doesn't depend on any deep properties of structured control flow. It barely even depends on shallow properties of structured control flow.

As the blog post mentions, SpiderMonkey's wasm baseline compiler doesn't preserve register allocator state through "control-flow joins" (i.e. basic blocks with multiple predecessors), instead using a fixed ABI, or mapping from the wasm stack to native stack and registers. I discovered through testing that it also uses a fixed ABI when entering blocks, even though that's not a control-flow join in most cases!

The fixed ABI is as follows (on x86):

  • If there is a nonzero number of parameters (when entering a block) or returns (when exiting a block), then the top of the wasm stack goes in rax, and the rest of the wasm stack corresponds to the x86 stack.
  • Otherwise, the entire wasm stack corresponds to the x86 stack.

Why does this matter?

Because this algorithm could work almost the same way with much less information. As a thought experiment, imagine an alternate-universe version of WebAssembly where there were no structured control flow instructions, just jump instructions, similar to native assembly. It would have to be augmented with just one piece of extra information: a way to tell which instructions are the targets of jumps.

Then the algorithm would just be: go through instructions linearly; before jumps and jump targets, flush registers to the fixed ABI.

The one difference is that there would have to be a single fixed ABI, not two. It couldn't distinguish between the top-of-stack value being semantically the 'result' of a jump, versus just being left on the stack from an outer block. So it would have to unconditionally put the top-of-stack in rax.

But I doubt this would have any measurable cost to performance; if anything, it might be an improvement.

(The verification would also be different but still single-pass.)

Okay, up-front caveats:

  1. This isn't an alternate universe; we're stuck with making backwards-compatible extensions to the existing WebAssembly.
  2. SpiderMonkey's baseline compiler is just one implementation, and it's possible that it's suboptimal with respect to register allocation: that if it were a tiny bit smarter, the runtime benefit would outweigh the compile-time cost.
  3. Even if baseline compilers don't need additional information, optimizing compilers may need it for fast SSA construction.

With those in mind, the above thought experiment strengthens my belief that baseline compilers do not need structured control flow. Regardless of how low-level a construct we add, as long as it includes basic information like which instructions are jump targets, baseline compilers can handle it with only minor changes. Or at least this one can.

@conrad-watt @comex

Those are very good points! My intuition about baseline compilers may well be wrong then.

And @comex - yes, as you said, this discussion is separate from optimizing compilers where SSA may benefit from the structure. Maybe worth quoting a bit from one of the links from before:

By design, transforming WebAssembly code into TurboFan’s IR (including SSA-construction) in a straightforward single pass is very efficient, partially due to WebAssembly’s structured control flow.

@conrad-watt I definitely agree we just need to get direct feedback from VM people, and keep an open mind. To be clear, my goal here isn't to stop anything. I commented here at length because several comments seemed to think that wasm's structured control flow was an obvious mistake or one that should obviously be remedied with funclets/multiloop - I just wanted to present the history of the thinking here, and that there were strong reasons for the current model, so it may not be easy to improve on.

I've really enjoyed reading this conversation. I've wondered a bunch of these questions myself (coming from both directions), and shared many of these thoughts (again from both directions), and the discussion has offered a lot of useful insights and experiences. I'm not sure I have a strong opinion yet, but I have a thought to contribute in each direction.

On the "for" side, it's useful to know up front which blocks have backedges. A streaming compiler can track properties that are not apparent in WebAssembly's type system (e.g. the index in local i is within the bounds of the array in local arr). When jumping forward, it can be useful to annotate the target with what properties hold at that point. That way when a label is reached, its block can be compiled using the properties that hold across all the in-edges, say to eliminate array-bounds checks. But if a label can potentially have an unknown backedge, then its block can't be compiled with this knowledge. Of course, a non-streaming compiler can do some more significant loop-invariant analysis, but for a streaming compiler it's useful to not have to worry about what might be ahead. (Side thought: @Vurich mentions that WebAssembly is not a stack machine due to its use of locals. In #1381 I laid out some reasons for relying less on locals and adding more stack operations. Making register allocation easier seems to be another reason in that direction.)

On the "against" side, so far the discussion has focused on only local control. That's fine for C, but what about for C++ or various other languages with similar exceptions? What about languages with other forms of non-local control? Things with dynamic scope often are inherently structured (or at least I don't know of any examples of mutually recursive dynamic scopes). I think these considerations are addressable, but you'd have to design something with them in mind for the result to be usable in these settings. This is something I've been pondering, and I'm happy to share my thoughts-in-progress (looking roughly like an extension of @conrad-watt's multi-loop) with anyone who's interested (although here seems off topic), but I wanted to at least give a heads up that there's more than just local control flow to keep in mind.

(I'd also like to throw in another +1 for hearing more from VM people, even though I think @kripken has been doing a great job representing the considerations.)

When I say Lightbeam produces an internal IR, that’s really quite misleading and I should have clarified. I was working on the project for a while and sometimes you can get tunnel vision. Basically, Lightbeam consumes the input instruction by instruction (it actually has a maximum of one-instruction lookahead but that’s not particularly important), and for each instruction it produces, lazily and in constant space, a number of internal IR instructions. The maximum number of instructions per Wasm instruction is constant and small, something like 6. It’s not creating a buffer of IR instructions for the whole function and working on that. Then, it reads those IR instructions one by one. You can really just think of it as having a library of more-generic helper functions that it implements each Wasm instruction in terms of, I just refer to it as an IR because that helps explain how it has a different model for control flow etc. It probably doesn’t produce code as fast as V8 or SpiderMonkey's baseline compilers but that’s because it’s not fully optimised and not because it’s architecturally deficient. My point is that I internally model Wasm's hierarchical control flow as if it were a CFG, rather than actually producing a buffer of IR in memory in the way that LLVM or Cranelift does.

Another option is to compile the wasm to something that you believe can handle control flow optimally, that is, "undo" the structuring. LLVM should be able to do that, so running wasm in a VM that uses LLVM (like WAVM or wasmer) or through WasmBoxC could be interesting.

@kripken Unfortunately, LLVM does not seem to be able to undo the structuring yet. The jump threading optimization pass should be able to do this, but does not recognize this pattern yet. Here's an example showing some C++ code that mimics how the relooper algorithm would convert a CFG to a loop+switch. GCC manages to "dereloop" it, but clang does not: https://godbolt.org/z/GGM9rP

@AndrewScheidecker Interesting, thanks. Yeah, this stuff can be pretty unpredictable, so there may be no better option than investigating the emitted code (as the "No So Fast" paper linked earlier does), and avoid attempted shortcuts like relying on LLVM's optimizer.

@comex

SpiderMonkey's baseline compiler is just one implementation, and it's possible that it's suboptimal with respect to register allocation: that if it were a tiny bit smarter, the runtime benefit would outweigh the compile-time cost.

It could clearly be smarter about register allocation. It spills indisciminately at control flow forks, joins, and before calls, and could maintain more information about the register state and try to keep values in registers longer / until they are dead. It could pick a better register than rax for value results from blocks, or better, not use a fixed register. It could statically dedicate a couple of registers to hold local variables; a corpus analysis I did suggested that just a few integer and FP registers would be enough for most functions. It could be smarter about spilling in general; as it is, it panic-spills everything when it runs out of registers.

The compile-time cost of this is chiefly that each control flow edge will have a non-constant amount of information associated with it (the register state) and this may lead to more pervasive use of dynamic storage allocation, which the baseline compiler has so far avoided. And of course there will be a cost associated with processing that variable-size information at each join (and other places). But there's already some non-constant cost since the register state has to be traversed to generate spill code, and by and large there may be few values live, so this may be OK (or not). Of course, being smarter with the regalloc may or may not pay off on modern chips, with their fast caches and ooo execution...

A more subtle cost is maintainability of the compiler... it is already quite complex, and since it is one-pass and does not build an IR graph or use dynamic memory at all, it is resistant to layering and abstraction.

@RossTate

Re funclets / gotos, I skimmed the funclet spec the other day and at first glance it did not look like a one-pass compiler should have any real problems with it, certainly not with a simplistic regalloc scheme. But even with a better scheme it might be OK: The first edge to reach a join point would get to decide what the register assignment is, and other edges would have to conform.

@conrad-watt as you just mentioned in the CG meeting, I think we'd be very interested in seeing details on what your multi-loop would look like.

@aardappel yeah, life has come at me fast, but I should do this in the next meeting. Just to emphasise that the idea isn't mine since @rossberg originally sketched it in response to the first draft of funclets.

One reference that might be instructive is a kind of a bit dated, but generalizes familiar notions of loops to handle irreducible ones using DJ graphs.

We've had a couple of discussion sessions about this in the CG, and I've written up a summary and follow-up document. Because of the length I've made it a separate gist.

https://gist.github.com/conrad-watt/6a620cb8b7d8f0191296e3eb24dffdef

I think the two immediate actionable questions (see the follow-up section for more details) are:

  • Can we find "wild" programs which are currently suffering and would benefit performance-wise from multiloop? These may be programs for which LLVM transformations introduce irreducible control flow even if does not exist in the source program.
  • Is there a world where multiloop is implemented producer-side first, with some linking/translation deployment layer for "Web" Wasm?

There is probably also a more free-wheeling discussion to be had on the consequences of the exception handling issues I discuss in the follow-up document, and of course standard bikeshedding about semantic details if we move forward with anything concrete.

Because these discussions may branch somewhat, it may be appropriate to spin some of them into issues in the funclets repository.

I am very happy to see progress on this issue. A huge "Thank you" to all people involved!

Can we find "wild" programs which are currently suffering and would benefit performance-wise from multiloop? These may be programs for which LLVM transformations introduce irreducible control flow even if does not exist in the source program.

I'd like to caution a bit against circular reasoning: Programs that currently have bad performance are less likely to occur "in the wild" for exactly this reason.

I think most Go programs should benefit a lot. The Go compiler either needs WebAssembly coroutines or multiloop to be able to emit efficient code that supports Go's goroutines.

Precompiled regular-expression matchers, along with other precompiled state machines, often result in irreducible control flow. It's hard to say whether or not the "fusion" algorithm for Interface Types will result in irreducible control flow.

  • Agree this discussion should be moved to issues on the funclets (or a new) repo.
  • Agree that finding program that would benefit from it is hard to quantify without having LLVM (and Go, and others) actually emit the most optimal control flow (which may be irreducible). The inefficiency caused by FixIrreducibleControlFlow and friends may be a "death by a thousand cuts" problem across a large binary.
  • While I would welcome a tools-only implementation as the absolute minimum progress coming out of this discussion, it would still not be optimal, as producers now have the tough choice of making use of this functionality for convenience (but then face unpredictable performance regressions/cliffs), or do the hard work to wrangle their output to standard wasm to have things be predictable.
  • If it were decided that "gotos" are at best a tools-only feature, I'd argue that you could probably get away with an even simpler feature than multiloop, since all you care about is producer convenience. At the absolute minimum, a goto <function_byte_offset> would be the only thing needed to be inserted into regular Wasm function bodies to allow either WABT or Binaryen to transform it into legal Wasm. Things like type signatures are useful if engines need to verify a multiloop quickly, but if its a convenience tool, might as well make it maximally convenient to emit.

Agree that finding program that would benefit from it is hard to quantify without having LLVM (and Go, and others) actually emit the most optimal control flow (which may be irreducible).

I agree that testing on modified toolchains + VMs would be optimal. But we can compare current wasm builds to native builds which do have optimal control flow. Not So Fast and others have looked at this in various ways (performance counters, direct investigation) and have not found irreducible control flow to be a significant factor.

More specifically, they didn't find it to be a significant factor for C/C++. That might have more to do with C/C++ than with the performance of irreducible control flow. (I honestly don't know.) It sounds like @neelance has reason to believe the same would not be true for Go.

My sense is that there are multiple facets to this problem, and its worthwhile tackling it through multiple directions.

First, it sounds like there's a general issue with the generatability of WebAssembly. Much of that is caused by WebAssembly's constraint to have a compact binary with efficient type-checking and streaming compilation. We could address this issue at least partly by developing a standardized "pre"-WebAssembly that is easier to generate but which is guaranteed to be translatable to "true" WebAssembly, ideally through just code duplication and insertion of "erasable" instructions/annotations, with at least some tool providing such translation.

Second, we can consider what features of "pre"-WebAssembly are worth directly incorporating into "true" WebAssembly. We can do this in an informed manner because we will have "pre"-WebAssembly modules that we can analyze before they have been contorted into "true" WebAssembly modules.

Some years ago I tried compiling a particular bytecode emulator for a dynamic language (https://github.com/ciao-lang/ciao) to webassembly and the performance was far from optimal (sometimes 10 times slower than the native version). The main execution loop contained a large bytecode dispatch switch, and the engine was finely tuned for decades to run on actual hardware, and we make heavy use of labels and gotos. I wonder if this kind of software would benefit from support for irreducible control flow or if the problem was another one. I didn't have time to do further investigation but I'd happy to try again if things are known to have improved. Of course I understand that compiling other languages VM to wasm is not the main use case, but I'd be good to know if this will be eventually feasible, specially since universal binaries that run efficiently, everywhere, is one of the promised advantages of wasm. (Thanks and apologies if this particular topic has been discussed in some other issue)

@jfmc My understanding is that, if the program is realistic (i.e. not contrived in order to be pathological) and you care about its performance, then it is a perfectly valid use case. WebAssembly aims to be a good general-purpose target. So I think it would be great to gain an understanding of why you saw such significant slowdown. If that happens to be due to restrictions on control flow, then that would be very useful to know in this discussion. If it happens to be due to something else, then that would still be useful to know for how to improve WebAssembly in general.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  ·  7Comments

artem-v-shamsutdinov picture artem-v-shamsutdinov  ·  6Comments

jfbastien picture jfbastien  ·  6Comments

aaabbbcccddd00001111 picture aaabbbcccddd00001111  ·  3Comments

spidoche picture spidoche  ·  4Comments