Design: Can WebAssembly.Instance recompile an already-compiled Module?

Created on 27 Oct 2016  ·  94Comments  ·  Source: WebAssembly/design

Say I have this code:

let descriptor = /* ... */;
let memories = Array.apply(null, {length: 1024}).map(() => new WebAssembly.Memory(descriptor));
let instance = fetch('foo.wasm')
  .then(response => response.arrayBuffer())
  .then(buffer => WebAssembly.compile(buffer))
  .then(module => new WebAssembly.Instance(module, { memory: memories[1023] }));

Is WebAssembly.Instance allowed to block for a substantial amount of time. Could it for example recompile the WebAssembly.Module?

In most cases I'd say no, but what if the already-compiled code doesn't particularly like the memory it receives? Say, because that memory is a slow-mode memory and the code was compiled assuming fast-mode? maybe memories[0] was a fast-mode memory, but memories[1023] sure won't be.

What about this code instead:

let instances = [0,1,2,3,4,5,6,7].map(v => fetch(`foo${v}.wasm`)
  .then(response => response.arrayBuffer())
  .then(buffer => WebAssembly.compile(buffer))
  .then(module => new WebAssembly.Instance(module)));

Are those calls to WebAssembly.Instance allowed to cause recompilation?

Assuming the above makes sense, here are a few related questions:

  • Do we want a promise-returning async function which can compile _and_ instantiate? I'm not saying that we should drop any of the synchronous and asynchronous APIs that we already have, I'm proposing a new asynchronous API.
  • How does a browser expose that compiled code in a WebAssembly.Module is fast, and that a WebAssembly.Memory instance is suitable for such fast code? Right now the answer seems to be "try it and see if you can notice".
  • How does a user know how many WebAssembly.Memory instances they're allowed before they get slow code (counting the implicit ones, e.g. as created by the second example)?
JS embedding

Most helpful comment

@kgryte I should have clarified that my comment pertained primarily to the browser as an execution context. We've landed on an API surface that still exposes the synchronous APIs. Browsers may impose a size limit on modules passed to the synchronous APIs (Chrome already does, for example), but that limit is configurable by the embedder and should not need to apply to Node.

All 94 comments

Would be nice for WebAssembly.Instance to sometimes cause recompilation, this way immutable globals vars could be constant folded in the generated code. For instance, Emscripten generates relocatable code by offsetting all pointers to static data. The offset is passed in as an immutable global var when the module is instantiated. If WebAssembly.Instance can recompile, it could specialize the generated code.

The specification doesn't define what "compilation" is, nor would it make
sense for it to do so, because implementation approaches may wildly differ
(including interpreters). So it cannot have any normative say about this
either way. The best we could do is adding a note that
WebAssembly.Instance is expected to be "fast".

On 27 October 2016 at 03:24, Michael Bebenita [email protected]
wrote:

Would be nice for WebAssembly.Instance to sometimes cause recompilation,
this way immutable globals vars could be constant folded in the generated
code. For instance, Emscripten generates relocatable code by offsetting all
pointers to static data. The offset is passed in as an immutable global var
when the module is instantiated. If WebAssembly.Instance can recompile,
it could specialize the generated code.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/WebAssembly/design/issues/838#issuecomment-256522163,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEDOO9sJPgujK3k0f6P7laYV_zaJxES5ks5q3_1LgaJpZM4Kh1gM
.

Agreed this would be a non-normative note at most.

In SM, we are also currently intending for instantiation to never recompile so that there is a predictable compilation cost model for devs (in particular, so that devs can use WebAssembly.compile and IDB to control when they take the compilation hit). Instantiation-time recompilation from within the synchronous Instance constructor would certainly break that cost model and could lead to major jank.

But I do appreciate that separate compilation is fundamentally at odds with a variety of optimizations one might want to do to specialize generated code to ambient parameters. Fusing compilation and instantiation into one async op makes sense and is something we've considered in the past. The downside, of course, is that this inhibits explicit caching (there is no Module), so the developer has to make an unpleasant tradeoff. Some options:

  • The impl could do implicit content-addressed caching (which could include ambient parameters in the key), like we do with asm.js currently in FF. This would be kindof a pain and has all the predictability/heuristic problems of any implicit cache.
  • We could create a new way (e.g., a new WebAssembly.Cache API where you pass in bytecode and instantiation parameters and get back a Promise<Instance>.

The latter intrigues me and could provide a much nicer developer experience than using IDB and perhaps a chance to even further optimize caching (since the cache is specialized to purpose), but it's certainly a big feature and something we'd want to take some time to consider.

@rossberg-chromium I seem to have explained my purpose badly: I don't want to quibble over what the spec says. I'm trying to point out what seems like a serious surprise for developers, hiding under the API. A developer won't expect .compile's result to be re-compiled. That seems like a design flaw to me.

@lukewagner even with implicit or explicit caching we may have the same issue: how many WebAssembly.Memory can be created in the same address-space / origin is a browser limitation. I like what you're suggesting, but I think it's orthogonal to the issue. Let me know if I've misunderstood what you suggest.

Maybe .compile and Module could be given a Memory, and Instance has a .memory property which can be passed to other compilations / instantiations?

I'm not trying to eliminate the possibility of re-compilation, I think we rather want a common idiomatic API usage which has perfect information w.r.t. Memory at first-compile-time (or at cache-retrieval time) so that the compilation emits bounds checks or not if needed.

@jfbastien With implicit/explicit caching that was provided the particular instantiation parameters (so Memory), I don't see how there would be any need for recompilation.

@jfbastien With implicit/explicit caching that was provided the particular instantiation parameters (so Memory), I don't see how there would be any need for recompilation.

There may be:

  1. Create many Memorys.
  2. Compile code, with explicit (slow) bounds check because there were too many Memoryies.
  3. Cache that code.
  4. Leave page.
  5. Load page again.
  6. Allocate only one Memory, which gets the fast version.
  7. Get from the cache.
  8. Receive slow code Instance.

At this point I agree you don't _need_ recompilation, but we're being a bit silly of we do slow bounds checks when we don't need to.

As I said: I like this Cache API you propose, I think it makes WebAssembly more usable, but I think the problem is still there. 😢

Well that's my point about having an enhanced cache that accepts instantiation parameters and bytecode: the cache is free to recompile if what it has cached doesn't match the instantiation parameters. So the steps would just be:

  1. create many Memorys
  2. request an Instance from the cache, passing one of those (slow) Memorys
  3. slow-code is compiled, cached and returned as an Instance
  4. leave page
  5. load page again
  6. allocate only one Memory
  7. request an Instance from the cache, passing the fast Memory
  8. fast-code is compiled, cached and returned as an Instance

and after step 8, all future page loads will get cached fast or slow code.

@lukewagner First of all, you're proposing a mitigation that flies in the face of the stated goal of WebAssembly providing deterministic performance. The difference between slow and fast was last quoted as around 20%, so it would really stink if a spec that painstakingly aims aims for deterministic perf drops it on the floor because of an API quirk. I don't buy that the browser having a content-addressed cache is the right solution, because the spec already goes to a lot of trouble elsewhere to obviate the need for profile-recompile-cache optimizations. For example, we promisify compilation precisely so that the app can get reasonable behavior even if the code is not cached. If the way this is spec'd necessitates all of us to implement caches or other mitigations then we will have failed our goal of giving people a reasonably portable cost model.

To me the issue is just this: one of the optimizations that we will all effectively have to do for competitive reasons (the 4GB virtual memory bounds checking, which I'll just call the 4GB hack) cannot be done in the current spec without sacrificing one of these things:

  • You can get away with it if you always allocate 4GB of virtual memory for any wasm memory. This will discourage people from using WebAssembly for small modules, because you will hit virtual memory allocation limits, virtual memory fragmentation, or other problems if you allocate many of these. I also fear that if you allow allocating a lot of them then you will reduce the efficacy of security mitigations like ASLR. Note that existing APIs don't share this danger, since they commit the memory they allocate and they will either OOM or crash before letting you allocate much more than what physical memory allows.
  • You can get away with it if you allow for a recompile when you find a mismatch during instantiation (compiled code wants 4GB hack but the memory doesn't have that virtual memory allocation). You could also get away with it if instantiation moved the memory into a 4GB region, but see the previous point. So, probably anytime this happens, it'll be a P1 bug for the browser that encountered it.

I think that this means that the spec will encourage vendors to converge to just allowing 4GB reservations anytime wasm memory is allocated, or to have cache/lazy-compile/profile optimizations to detect this.

Finally, I don't understand the point about making any of this non-normative. This can be normative, because we could make the API preclude the possibility of the browser having to compile something without knowing what kind of memory it will have. I imagine that there are many ways to do this. For example, instantiating could return a promise and we could remove the separate compile step. This would make it clear that instantiation is the step that could take a while, which strongly implies to the client that this is the step that does the compilation. In such an API, the compiler always knows if the memory it's compiling for has the 4GB hack or not.

It's sad that we're only noticing this now, but I'm surprised that you guys don't see this is a bigger issue. Is there some mitigation other than caching that I'm overlooking?

@jfbastien in your motivating scenario, you pointed out that the module was authored to prefer fast memory. I'm assuming you're primarily chasing enabling the fast memory optimization when a particular module wants it, and might be OK with not doing it when the module doesn't want it (nothing bad with opportunistically stumbling upon it in that case, too, just trying to tease apart priorities).

If so, how would these alternatives to caching or to async Instantiate feel like:

  1. Module author must require 4GB as min/max memory
  2. A variant of compile (async at least, maybe also sync) that produces an instance accepting only fast memory.

For the issue of the "4GB hack" and mismatches between memory using it and code expecting it, would it make sense for compilation to internally emit two versions of the code? (Obviously this would use more memory, which is sad, but hopefully compile time wouldn't be much worse, the writer could generate both at once?)

@mtrofin I don't think it makes sense to ask for 4GiB if you don't intend to use it. The virtual allocation is separate from the use intent, so I think we'd need to separate both.

On 2.: it still isn't super helpful to the developer: if they use that variant and it fails, then what?

@kripken I don't think double compilation is a good idea.

@kripken I think that's what we would do without any other resolution to this issue.

I want WebAssembly to be great in the case of casual browsing: you tell me about a cool thingy, send me the URL, I click it, and I amuse myself for a few minutes. That's what makes the web cool. But that means that many compiles will be of code that isn't cached, so compile time will play a big part in a user's battery life. So, double-compile makes me sad.

@mtrofin

Module author must require 4GB as min/max memory

That's not really practical, since many devices don't have 4GB of physical memory. Also, that's hard to spec.

A variant of compile (async at least, maybe also sync) that produces an instance accepting only fast memory.

I don't think we want double compiles.

@pizlonator Thus far, we haven't considered designs which required different modes of codegen: we've just always allocated 4gb regions on 64-bit and observed this to succeed for many many thousands of memories on Linux, OSX and Windows. We have a conservative upper bound to prevent trivial total exhaustion of available address space which I expect will be sufficient to support the many-small-libraries use case. So I think the new constraint we're addressing here is that iOS has some virtual address space limitations which could reduce the number of 4gb allocations.

So one observation is that a large portion of the bounds-check-elimination allowed by the 4gb hack can be avoided by just having a small guard region at the end of wasm memory. Our initial experiments show that basic analyses (nothing to do with loops, just eliminating checks on loads/stores w/ the same base pointer) can already eliminate roughly half of bounds checks. And probably this could get better. So the 4gb hack would be a more modest, and less necessary, speedup.

Another idea I had earlier would be to pessimistically compile code with bounds checks (using elimination based on the guard page) and then nop them out when instantiating with a fast-mode-memory. Combined, the overhead could be pretty small compared to idealized fast-mode code.

@lukewagner

Thus far, we haven't considered designs which required different modes of codegen: we've just always allocated 4gb regions on 64-bit and observed this to succeed for many many thousands of memories on Linux, OSX and Windows. We have a conservative total number to prevent trivial total exhaustion of available address space which I expect will be sufficient to support the many-small-libraries use case. So I think the new constraint we're addressing here is that iOS has some virtual address space limitations which could reduce the number of 4gb allocations.

This isn't an iOS-specific problem. The issue is that if you allow a lot of such allocations then it poses a security risk because each such allocation reduces the efficacy of ASLR. So, I think that the VM should have the option of setting a very low limit for the number of 4GB spaces it allocates, but that implies that the fall-back path should not be too expensive (i.e. it shouldn't require recompile).

What limit do you have on the number of 4GB memories that you would allocate? What do you do when you hit this limit - give up entirely, or recompile on instantiation?

So one observation is that a large portion of the bounds-check-elimination allowed by the 4gb hack can be avoided by just having a small guard region at the end of wasm memory. Our initial experiments show that basic analyses (nothing to do with loops, just eliminating checks on loads/stores w/ the same base pointer) can already eliminate roughly half of bounds checks. And probably this could get better. So the 4gb hack would be a more modest, and less necessary, speedup.

I agree that analysis allows us to eliminate more checks, but the 4GB hack is the way to go if you want peak perf. Everyone wants peak perf, and I think it would be great to make it possible to get peak perf without also causing security problems, resource problems, and unexpected recompiles.

Another idea I had earlier would be to pessimistically compile code with bounds checks (using elimination based on the guard page) and then nop them out when instantiating with a fast-mode-memory. Combined, the overhead could be pretty small compared to idealized fast-mode code.

Code that has bounds checks is best off pinning a register for the memory size and pinning a register for memory base.

Code that uses the 4GB hack only needs to pin a register for memory base.

So, this isn't a great solution.

Besides the annoyance of having to wrangle the spec and implementations, what are the downsides of combining compilation and instantiation into one promisified action?

The issue is that if you allow a lot of such allocations then it poses a security risk because each such
allocation reduces the efficacy of ASLR.

I'm not an expert on ASLR but, iiuc, even if we didn't have a conservative bound (that is, if we allowed you to keep allocating until mmap failed because the kernel hit its number-of-address-ranges max), only small fraction of the entire 47-bit addressable space would be consumed so code placement would continue to be highly random over this 47-bit space. IIUC, ASLR code placement isn't completely random either; just enough to make it hard to predict where anything will be.

What limit do you have on the number of 4GB memories that you would allocate? What do you do
when you hit this limit - give up entirely, or recompile on instantiation?

Well, since it's from asm.js days, only 1000. Then the memory allocation just throws. Maybe we'll need to bump this, but even with many super-modularized apps (with many seprate wasm modules each) sharing the same process, I can't imagine we'd need too much more. I think Memory is different than plain old ArrayBuffers in that apps won't naturally want to create thousands.

Besides the annoyance of having to wrangle the spec and implementations, what are the downsides
of combining compilation and instantiation into one promisified action?

As I mentioned above, adding a Promise<Instance> eval(bytecode, importObj) API is fine, but now it places the developer in a tough spot because now they have to choose between a perf boost on some platforms vs. being able to cache their compiled code on all platforms. It seems we need a solution that integrates with caching and that's what I was brainstorming above with the explicit Cache API.

New idea: what if we added an async version of new Instance, say WebAssembly.instantiate and, as with WebAssembly.compile, we say that everyone is supposed to use the async version? This is something I've been considering _anyway_ since instantiation can take a few ms if patching is used. Then we say in the spec that the engine can do expensive work in either compile or instantiate (or neither, if an engine does lazy validation/compilation!).

That still leaves the question of what to do when a compiled Module is stored in IDB, but that's just a hard question when there are multiple codegen modes _anyway_. One idea is that Modules that are stored-to or retrieved-from IDB hold onto a handle to their IDB entry and they add new compiled code to this entry. In that way, the IDB entry would lazily accumulate one or more more compiled versions of its module and be able to provide whichever was needed during instantiation.

The IDB part is a bit more work, but that seems pretty close to ideal, performance-wise. WDYT?

I think adding async instantiate makes sense, but I'd also add a Memory parameter to compile. If pass a different memory to instantiate then you can get recompiled, otherwise you've already "bound" the memory when compiling.

I haven't thought about the caching enough to have a fully-formed opinion yet.

@lukewagner

I'm not an expert on ASLR but, iiuc, even if we didn't have a conservative bound (that is, if we allowed you to keep allocating until mmap failed because the kernel hit its number-of-address-ranges max), only small fraction of the entire 47-bit addressable space would be consumed so code placement would continue to be highly random over this 47-bit space. IIUC, ASLR code placement isn't completely random either; just enough to make it hard to predict where anything will be.

ASLR affects both code and data. The point is to make it more expensive for an attacker to weasel his way into a data structure without chasing a pointer to it. If the attacker can exhaust memory, he definitely has more leverage.

Well, since it's from asm.js days, only 1000. Then the memory allocation just throws. Maybe we'll need to bump this, but even with many super-modularized apps (with many seprate wasm modules each) sharing the same process, I can't imagine we'd need too much more. I think Memory is different than plain old ArrayBuffers in that apps won't naturally want to create thousands.

1000 seems like a sensible limit. I'll ask around with security folks.

As I mentioned above, adding a Promise eval(bytecode, importObj) API is fine, but now it places the developer in a tough spot because now they have to choose between a perf boost on some platforms vs. being able to cache their compiled code on all platforms. It seems we need a solution that integrates with caching and that's what I was brainstorming above with the explicit Cache API.

Right. I can see a few ways that such an API could be made to work. A cheesy but practical API would be to overload eval:

  1. instancePromise = eval(bytecode, importObj)
  2. instancePromise = eval(module, importObj)

and then Instance has a getter:

module = instance.module

Where module is structure cloneable.

What do you think of this?

New idea: what if we added an async version of new Instance, say WebAssembly.instantiate and, as with WebAssembly.compile, we say that everyone is supposed to use the async version? This is something I've been considering anyway since instantiation can take a few ms if patching is used. Then we say in the spec that the engine can do expensive work in either compile or instantiate (or neither, if an engine does lazy validation/compilation!).

That still leaves the question of what to do when a compiled Module is stored in IDB, but that's just a hard question when there are multiple codegen modes anyway. One idea is that Modules that are stored-to or retrieved-from IDB hold onto a handle to their IDB entry and they add new compiled code to this entry. In that way, the IDB entry would lazily accumulate one or more more compiled versions of its module and be able to provide whichever was needed during instantiation.

The IDB part is a bit more work, but that seems pretty close to ideal, performance-wise. WDYT?

Intriguing. Relative to my idea above:

Pro: yours is an easy to understand abstraction that is conceptually similar to what we say now.
Con: yours does not lead to as much synergy between what the user does and what the engine does as my proposal allows.

There are three areas where your proposal doesn't give the user as much control as mine:

  1. The expensive work could happen in one of two places, so the user has to plan for either of them being expensive. We will probably have web content that behaves badly if one of them is expensive, because it was tuned for cases where it happened to be cheap. My proposal has one place where expensive things happen, leading to more uniformity between implementations.
  2. There's no clearly guaranteed path for all versions of the compiled code to be cached. On the other hand, my use of threading the module through the API means that the VM can build up the module with more stuff each time, while still allowing the user to manage the cache. So, if the first time around we do 4GB then this is what we will cache, but if we fail to do 4GB the second time, we will be able to potentially cache both (if the user caches instance.module after every compile).
  3. Unusual corner cases in the browser or other issues could sometimes lead to a double compile in your scheme, because we'd compile one thing in the compile step but then realize we need another thing in the instantiation step. My version never requires a double compile.

So, I like mine better. That said, I think your proposal is a progression, so it definitely sounds good to me.

This issue rests upon how often fragmentation makes allocation of fast
memory (btw you'll 4GB + maximum supported offset, or 8GB) fails. If the
probably is way less than 1%, then it might not be entirely unreasonable to
have that be an OOM situation.

In the scenario where the user is browsing around the web and using lots of
little WASM modules in quick succession, presumably they aren't all live at
once. In that case, a small cache of reserved 4GB chunks would mitigate the
issue.

Another possible strategy is to generate one version of the code with
bounds checks, and if fast memory is available, just overwrite the bounds
checks with nops. That's ugly, but that's a heck of a lot faster than a
recompile, and less space than two compiles.

On Thu, Oct 27, 2016 at 9:03 PM, pizlonator [email protected]
wrote:

@mtrofin https://github.com/mtrofin

Module author must require 4GB as min/max memory

That's not really practical, since many devices don't have 4GB of physical
memory. Also, that's hard to spec.

A variant of compile (async at least, maybe also sync) that produces an
instance accepting only fast memory.

I don't think we want double compiles.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/WebAssembly/design/issues/838#issuecomment-256738329,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1F6CYUaq0unla0H6RYivUC8jfxIAks5q4PWIgaJpZM4Kh1gM
.

It's not just ASLR: it's also pagetable / allocator / etc pollution. We all need to talk to our security folks _as well as_ out kernel / systems folks. Or we can be up-front to developers about the limits each engine imposes on "fast" Memory, and make it idiomatic in the API so it's hard to use it wrong.

There's all these crutches we can use, such as nop or double compilation, but why even have crutches?

@jfbastien I don't think PROT_NONE memory costs page table entries; I think there is a separate data structure that holds mappings from which the page table is lazily populated.

@pizlonator I like that idea, and I can see that being what we encourage everyone to use by default in tutorials, toolchain, etc. It's also more succinct and easier to teach if you can simply ignore Module. This could also address @s3ththompson's concern about discouraging use of the sync APIs by making the nicest API the async one.

However, I think we shouldn't take away WebAssembly.compile and the Module constructor: I'm imagining scenarios where you have a "code server" (providing cross-origin code caching via IDB + postMessage; this has been specifically discussed with some users already) that wants to compile and cache code without having to "fake up" instantiation parameters. (There could also be some needless overhead (garbage, patching, etc) for an unnecessary instantiation.) And, for the same corner cases that want synchronous compilation (via new Module), we would need to keep new Instance.

So if agreed on that, then what this boils down to is a purely additive proposal of the two WebAssembly.eval overloads you mention. Yes?

One tweak, though: I think we shouldn't have a module getter since this would require the Instance to keep some internal data around (viz., bytecode) for the lifetime of the Instance; right now Module can usually be GCed immediately after instantiation. This would suggest either a data property (that the user can remove, although they will probably forget to), or maybe a third version of eval that returns an {instance, module} pair...

Having an async one step API as the recommended case for the typical monolithic app makes sense as the recommended pattern.

Agreed with @lukewagner that both the all sync (inline compile) case covered by new Module + new Instance is useful.
Also the background compile (async) server with sync instantiate also seems useful.

Adding the two eval variants proposed seem an ok way to introduce this.

However I don't like the name, because it will be conflated in (security) folks mind with js eval (which it resembles in one way, but not in terms of scope capture).
How about WebAssembly.instantiate ?

Hah, good point, eval does have a bit of a rep. +1 to WebAssembly.instantiate.

What would the guideline to the developer be wrt when to use the async instantiate?

@mtrofin To use WebAssembly.instantiate by default unless they had some special code-sharing/loading scheme that required compiling Modules independently of any particular use.

@lukewagner This seems reasonable.

Hah, good point, eval does have a bit of a rep. +1 to WebAssembly.instantiate.

Agreed.

So if agreed on that, then what this boils down to is a purely additive proposal of the two WebAssembly.eval overloads you mention. Yes?

That's what it sounds like.

I think we shouldn't have a module getter since this would require the Instance to keep some internal data around (viz., bytecode) for the lifetime of the Instance; right now Module can usually be GCed immediately after instantiation. This would suggest either a data property (that the user can remove, although they will probably forget to), or maybe a third version of eval that returns an {instance, module} pair...

Sure feels like a data property is better. Or having WebAssembly.instantiate always return an instance, module pair.

Is this correct: Suppose you WebAssembly.instantiate with the goal of getting a fastmemory module variant. You now get the module, and structure-clone it. Now, this module is bound to needing to be instantiated with Memory-es supporting fastmemory.

@pizlonator Yeah, I can bikeshed it in my head multiple ways. I think I like returning the pair a little better since it'll probably lead to less people accidentally entraining an unused Module.

@mtrofin Recompilation can still be necessary when you pluck the Module off one instantiate call and instantiate with new imports; I think the point of this API addition is that it won't be the common case and it will only happen when it's fundamentally necessary (i.e., you have 1 module accessing two kinds of memories).

This thread is getting long, looks like it's converging but to be 100% sure we need to write up the code that we expect different users to write:

  1. Async instantiation of a single module.
  2. Async instantiation of a module, with memory sharing with other modules.
  3. Synchronous instantiation of a single module (I don't think synchronous multi-module is useful?).
  4. Caching for all of these (both putting into the cache, as well as retrieving and instantiating, with memory).
  5. Update of a single .wasm module, and cached loads of the other modules.

Anything else? It sounds like @lukewagner has ideas around imports which I don't fully grok.

That means that subsequent uses of this module must instantiate asynchronously, or risk blocking the UI thread with a surprisingly lengthy synchronous instantiate.

@jfbastien I'd want to understand for each snippet we expect developers to write, what would motivate them to go that particular path, and what information the developer must have available to make a decision.

@mtrofin Right, given a Module m, you'd call WebAssembly.instantiate(m) which is async. You _could_ call new Instance(m) and it might be expensive, but that's no different than new Module(m).

@jfbastien Assuming when you say "async instantiation" you mean "async compilation and instantiation", here's the short version:

  1. WebAssembly.instantiate(bytecode, imports)
  2. WebAssembly.instantiate(bytecode, imports), where imports includes the shared memory
  3. new Instance(new Module(bytecode), imports)
  4. In all cases you can get a Module, then you put that in an IDBObjectStore. Later, you get a Module m back and call WebAssembly.instantiate(m, imports).
  5. Nothing really special here: you WebAssembly.instantiate one module from bytecode and instantiate the rest from the Modules pulled from IDB.

Should we recommend using the sync instantiate if you feel you can use the sync compile, and async instantiate if you feel you should use the async compile?

Aside that, I am concerned that the developer would now face a more complex system: more choices that transpire optimizations we're planning on making, and I'm not sure the developer has the right information available for making the tradeoffs. Thinking from the developer's perspective, is there a smaller set of concerns they care about and would feel comfortable expressing? We talked at a point about developers having an "optimize at the expense of precise failure points" (this was re. hoisting bounds checks). Would an alternative be an "optimize" flag?

@mtrofin 99% of what developers would write (or have generated for them by the toolchain) would be WebAssembly.instantiate. You'd only use sync APIs for special "I'm writing a JIT in wasm" and WebAssembly.compile if you're writing some code sharing system so I would think the "Getting Started" tutorials would exclusively cover WebAssembly.instantiate.

@lukewagner I notice you added imports to #3 new Module() above. I think that plus adding it to WebAssembly.compile is a good idea and rounds out the possibilities.
That way if you want to hint about the memory at compile time you can.
If you later instantiate again with different imports, especially synchronously, you may get a hiccup.

So summary of changes (just so I'm clear):

  • Add WebAssembly.instantiate(bytes, imports) returns promise of {instance:, module:}
  • Add WebAssembly.instantiate(module, imports) returns promise of {instance:, module:}
  • Change to new Module(bytes, imports) returns module
  • Change to WebAssembly.compile(bytes, imports) returns promise of instance

State somewhere the expectation that instantiate will be fast if imports from compile match instantiate.

WDYT?

Oh oops, I meant to put the imports as an arg to Instance. I'm not convinced it's necessary for Module or compile. [Edit: because if you had them, you'd just call instantiate]

So that would mean that for the end-to-end async case, you can know that you'll be binding to a 4GB hack memory, but not for a JITed filter kernel or a background compiled item (unless you also create a throw-away instance)?

+1 on focusing the guidance on the async pair of compile & instantiate - makes the message simple and hides the complexities of the decision problem from the developer.

Yeah I think we're all in agreement that we'd point folks at:
First time:
WebAssembly.instantiate(bytes, imports) -> promise of {module, instance} (cache module to indexeddb)
Second time:
WebAssembly.instantiate(module, imports) -> promise of {module, instance}

Any objections to that being the main pattern?

I'm torn on imports for compile / new Module. It seems like that could be a useful hint.
Though, I'd be open to mentioning it as a possibility and deferring adding that arg (it could be optional) to Post-MVP.

Thoughts?

@mtrofin (Well, technically, just instantiate.)

@lukewagner (I think that's what @mtrofin meant)

@lukewagner, @flagxor OK, but we're keeping the async compile API, right?

How about this scenario: you get an application like PhotoShop with tons of plugins. Each plugin is a wasm module. You start the main app and you manage to allocate the magic memory size that triggers fastmemory (seems reasonable for this scenario - one app, memory hungry).

You want to compile a number of the plugins in parallel so you fire off some workers to do that. You can't pass those compilations the actual memory you'll use (correct?). So, depending on defaults, you get slowmemory compilation for the plugins, which will then be followed by a costly slew of async recompilations for the fastmemory when the plugins get connected to the app.

If we buy this scenario, then it feels there may be value in passing some memory descriptor (to be clear, without actual backing memory) to the compile API.

Yes, it should be possible (even encouraged) to pass Memory to the compilation.

@mtrofin Right, compile for advanced uses. I suppose that plugin example is a valid case where you'd want to _compile_, _and_ you have a Memory, but you don't want instantiate (yet).

@pizlonator Btw, I meant to ask earlier, assuming the "throw if more than 1000 4gb maps per process" hack is sufficient to address the ASLR/security concerns, is there _still_ a need for slow-mode/fast-mode due to platform virtual address quota restrictions? Because if there wasn't, it's certainly be nice if this wasn't even a performance consideration for even advanced users. (The instantiate APIs seem useful to add for the other reasons we mentioned, of course.)

There are also applications that might benefit from a memory size that is a power of two plus a spill area, where the application already masks pointers to remove tagging so can mask off the high bits to avoid bounds checking. These applications need to reason about the memory allocation size that they can receive before compilation, either by modifying global constants used for masking or to bake in suitable constants while de-compressing to wasm.

There is also the buffer-at-zero optimization that some runtimes might want to take advantage of, and there is only going to be one such buffer per process.

It would also make the platform more user friendly if it could reason about the required memory and available memory before compiling the application. For example, to allow the browser or app to inform the user that they need to close some tabs in order to run the application, or to run it without degraded capability.

A web browser might want to have a dedicated-app mode that is an option for users, where they might be running on a limited device and need all the memory and performance they can get just to run the one application well. For this it needs to be able to reason about the requirements early.

The memory should not need to be allocated before compilation, rather a reasoned reservation made. On a limited device, compilation alone might use a lot of memory, so even a large VM allocation might be a show stopper.

These are not new issues, have been under discussion for years now. Memory resource management is needed, and it needs to be coordinated with code generation.

@lukewagner I think so, because if we limited to 1000 modules, then I'd worry that the tolerances aren't big enough.

  • I'd worry about an attack surfacing that needed this ceiling to go down.
  • I'd worry about optimizations elsewhere in the stack reducing the amount of virtual address space that is available to us, which then would require us to reevaluate whether the ceiling is low enough.
  • I'd worry about preventing programming styles that deliberately create 1000s of modules. For example, I know that most JavaScriptCore framework clients create a VM, do a tiny bit of work, and then destroy it. If WebAssembly is used the same way from JS as JSC is from Objective-C, then to make it work on 64-bit systems the GC would have to know that if you allocate 1000 memories - even if each was small - then you have to GC in case the next allocation should succeed on the grounds that those 1000 memories are now unreachable. The ability to allocate non-4GB-hack memories after there are already, say, 10 4GB-hack memories live would mean that the GC wouldn't have to change its heuristics very much. It wouldn't have to do a GC when you allocate the 1001st module in your instantiate->run->die loop. This would be a benefit to patterns that use a tiny memory. Anything below 1MB, and it starts to make sense to have 1000 of them. I can imagine people doing useful things in 64KB.
  • I'd worry about this being less useful to other JavaScript contexts. I want to leave the door open for JSC C API and Objective-C API clients to have access to WebAssembly API from their JS code. Those clients would probably prefer a small limit on the number of 4GB memories we allocate. It's tempting to even make that quota configurable in such a context.

I like that the improved API removes the need to have an artificial ceiling on the number of memories, or the need to recompile, or other undesirable things. I don't like artificial ceilings unless the tolerances are very large, and I don't think that's the case here.

@pizlonator Fair enough, and it's a cleaner/simpler API, so I think it's fine to add.

As for why I'm not concerned with those items you've mentioned (at this time):

  • The limit may well need to be raised; that's easy.
  • At whatever reasonable limit, only a small fraction of the total 64-bit address space will be used, so I'm not aware of what that attack vector is here; determined content has many ways to OOM itself
  • We bump the GC heuristics commensurate with the reservation size, and thus churning through Memorys just leads to more aggressive GC. More GC isn't great, but I'm not sure this will be a common pattern.

But who knows what we'll see in the future, so I suppose it's useful to have the flexibility built in now.

I think it's a bad idea to try to surface too much implementation detail in the js API (especially architecture/platform specific detail). I think it should be sufficient for implementations to have their own internal limits for fast memory.

Having an async instantiate function seems reasonable but I think we should use something else if we want to use it to give compiler hints. For example, we could extend modules to have flags section that requests to optimize for singleton, optimize for many instances, optimize for load time, optimize for predictable perf, etc. Of course, what (if anything) engines do is totally implementation dependent, but it gives devs a knob to turn and competition will keep browser vendors honest.

On Fri, Oct 28, 2016 at 2:15 AM, JF Bastien [email protected]
wrote:

Yes, it should be possible (even encouraged) to pass Memory to the
compilation.

I think that should rather be discouraged in favor of not
importing/exporting memories at all. After all if a module does not import
or export memory, memory can be reserved at compilation time. I know we
want to be able to efficiently handle the sophisticated module-fu that some
applications will want to do, but I expect that monolithic WASM apps will
be more common than we're anticipating. Maybe I'm in the minority here, but
I'd rather see fewer modules with less dynamic binding.

You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/WebAssembly/design/issues/838#issuecomment-256805006,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALnq1KXrUWaegRZwEhznmT1YcyI33IN9ks5q4T6TgaJpZM4Kh1gM
.

I totally agree with you, I think monolithic apps are going to be the primary use case, at least for the immediate couple years after mvp. Worrying about what happens when you have 10s of thousands of modules loaded is assuming a lot about a wasm ecosystem that doesn't yet exist.

So what is left to do to resolve this issue? One thing would be to update http://webassembly.org/getting-started/js-api, which I can do. Another would be for binaryen to emit this by default (sound good @kripken?). @titzer does Canary implement WebAssembly.instantiate?

Anything else?

@lukewagner: not sure what you're asking, the discussion in this issue is very long. Do you mean to change binaryen from the current sync WebAssembly.Instance API it uses to use one of the new promise-based APIs proposed here? Is that necessary - are we removing the old way?

@kripken Right, switching to use WebAssembly.instantiate. We're not removing the old way, but the new way is more efficient and is meant to be used by default.

We could use the promise-based API when generating HTML, but many users generate JS and automatically adding async steps there isn't trivial. We can document this though for people. But I guess we should do all that only once this lands in all browsers.

@kripken I'm not sure I understand: the current approach is to use no async API at all?

Yes. See this issue for adding async stuff.

@kripken Node would have working promises, I assume. Even shells have a way to explicitly drain the promise queue (and thus execute all the resolutions synchronously); in SM it's drainJobQueue() and in V8 I hear there's a %RunMicroTasks(). Seems like you could simply feature test WebAssembly.instantiate and use it when present by default.

Sure, but first, Promise support might be in latest node.js, but it isn't in versions commonly used (default version in linux distros for example). And second, the bigger issue is that when we emit HTML, we have control over how the page is loaded (emcc emits the loading code for the user) while when we emit JS, the JS is assumed to just execute linearly, and users depend on that, e.g. they can have another script tag right after that JS. In that case the user writes their own loading code.

As a result of both of those, as mentioned earlier we can use Promise APIs when emitting HTML (then you are definitely not in the shell, and have control over loading), but not when emitting JS. There we can only document it.

Does Node have versions which support WebAssembly but not Promise? Do Node people care about these versions?

I don't understand the straightline JS thing. If you always return a Promise, won't that Just Work (users of the code need to consume the promise)?

I don't know the answer to the first question, but a polyfill might let wasm run on a node version that lacks promises. Although, it might be possible to polyfill promises too as node has had a version of setTimeout for a while now, I think, but not sure about that either.

About the straightline issue: emcc emits JS that sets up the runtime and connects up to the compiled code. Some JS in a script tag right after it might call into that compiled code, e.g. using ccall. In other words, emcc's JS output isn't a promise, so I'm not sure what you mean by "return a promise" - who would return it, and who would receive it? But in any case, as mentioned earlier, the recommended path is for emcc to emit HTML, in which case we can create async loading code. It's just that some users prefer to control loading more directly. We'll need to encourage them to use the async wasm stuff if it's better.

IMO Node with WebAssembly but without Promises isn't a design constraint we should worry about. A polyfill to WebAssembly is pretty silly in that context.

You're describing what the code does today. I don't fully understand it, but I'd like to back up: I want the code on web pages to use the Promise API. Emscripten is a producer of such code. I don't understand what's blocking it from emitting code which uses promises. I'm totally fine if you say "that's significant work because it's not how it works today, but we'll get to it". But from our discussion I'm not sure if that's what you're saying.

Is the issue you point out just about using the async APIs for caching? That's a start I guess, but the endstate I'd like to get to is where the async API is used even on first load.

Why is a polyfill to wasm silly on node? Still seems useful in that context, even if less than in other cases :)

Again: Emscripten will use the promise API when it emits HTML. And that is the recommended path. So the answer to your question is "yes". It is not significant work. It is sketched out in that issue, which yes, focuses on caching but I added notes from the (old) offline discussion that while doing that we should also do a bunch of other async optimizations there, since we can and its trivial.

Does that address your concerns?

All I am saying is that when Emscripten emits JS - the less-recommended path - then the guarantees about that output are not consistent with the code doing some async magic internally. We would be breaking people's code, which we don't want to do. As I said, someone can write JS that runs synchronously right after that JS that assumes it is ready. So we won't be able to use promises in that case. Imagine this:

````

````

doSomething needs Module.my_func to exist. If output.js just returned a promise, then it doesn't exist yet. So this would be a breaking change.

Does that makes sense now?

Why is a polyfill to wasm silly on node? Still seems useful in that context, even if less than in other cases :)

A polyfill to wasm isn't silly. Catering to Node installs that don't have wasm, polyfill it, and don't have promise but don't polyfill that, is silly. It's half-assed. They should get the other half of the ass 😁

Again: Emscripten will use the promise API when it emits HTML. And that is the recommended path. So the answer to your question is "yes". It is not significant work. It is sketched out in that issue, which yes, focuses on caching but I added notes from the (old) offline discussion that while doing that we should also do a bunch of other async optimizations there, since we can and its trivial.

Does that address your concerns?

OK that's good! As long as most web pages use the promise API, I'm happy.

[snip]

Does that makes sense now?

Yes. Thanks for explaining.

From my POV though, Emscripten isn't just in the business of emitting JS anymore! Your example makes sense for Ye Olden Codes, but new stuff should assume promises IMO.

Btw, I looked at switching webassembly.org/demo to use instantiate and it's a bit tricky b/c the current synchronous new Instance occurs in a context that wants a synchronous result. So once we get Binaryen updated to emit instantiate by default, it'd be nice to rebuild the AngryBots demo from scratch.

Yeah, but note that a rebuild from scratch might not be enough - I believe Unity uses its own loading and HTML code. So we will need to document and communicate this issue as mentioned earlier so that they can do the necessary things (or maybe we can get them to let emcc emit the html, but I don't know if that's feasible for them).

Yeah, but note that a rebuild from scratch might not be enough - I believe Unity uses its own loading and HTML code. So we will need to document and communicate this issue as mentioned earlier so that they can do the necessary things (or maybe we can get them to let emcc emit the html, but I don't know if that's feasible for them).

Given the potential downsides of not using the WebAssembly.instantiate API, I think it's worth asking them to consider using it.

Are those downsides documented? Clear guidance on this on either the main wasm website or the emscripten wasm wiki page would be a convenient thing to point people to.

I just skimmed this long issue myself, and I'm not clear on the downsides yet, so I want to read that too :)

@kripken The challenge with the current code is that binaryen/emscripten only provide the necessary imports (needed as arguments to instantiate) at the same time as the exports are synchronously required. If the imports can be made available "up front", then it's quite trivial to tack on a WebAssembly.instantiate to the tail of the async XHR (as I've done in the current demo with the async compile). So I don't think this will require much work on Unity's side. Also, from what I've seen, our current AngryBots wasm build is suboptimal and needs to be freshened up anyway.

Oh, I didn't understand that having the imports available before the wasm XHR even begins is key here. That's a lot trickier then. So most of what I said before is wrong.

For us to have the imports, we need to have downloaded, parsed and run all the JS glue. If we do all that before the wasm XHR even begins then it's a very different loading scheme and set of tradeoffs than we have now. In particular, for small to medium projects, maybe this wouldn't even be a speedup, I guess we'd have to measure this - if we haven't already?

This would not be something simple for Unity to do. It would require significant changes to the code emcc emits, so that the JS glue can be run before the compiled code is ready.

Perhaps we'd want to consider a new model of emitted JS, one JS file for the imports, one JS file for the rest? And it would be opt-in, so we wouldn't break anyone. Anyhow, there's a lot of consider, and without measurements it's hard to guess what is optimal.

@kripken Not before the XHR but after it completes, and some time before we start running the script that wants to synchronous access to the exports object. I'd expect it could be as simple as putting that exports-using code in some callback called when the instantiation resolves.

Hmm, I guess I still don't fully understand this then, sorry (in particular, I don't understand why the asyncness interacts with getting the imports - could the benefit of having the imports not work synchronously?). But it seems like the issues I mentioned above are still relevant even if this is after the XHR completes - that is, we have a single script now that generates the imports and also receives the exports. Splitting that up may lead to tradeoffs - we'll just need to measure when we have time.

About putting the exports-using code in a callback, sadly it won't just work for the reasons mentioned earlier, but we could investigate as part of those measurements adding some new compilation mode.

The signature of instantiate is WebAssembly.instantiate(bytes, importObj), so to fire off the async compile+instantiate, you need to pass importObj.

Talking offline, @lukewagner impressed upon me that the main issue here is having the Memory while compiling the module. Overall then, it seems that there are three issues that relate to how easy it is to use this new API in practice:

  1. Providing the Memory when compiling.
  2. Providing all the imports when compiling (superset of the previous).
  3. Doing all this asynchronously.

Given how the toolchain currently works, doing 2 + 3 is hard, as described above, because it will break existing users. We can do it, but possibly not want it on by default, at least not initially - we'd need to consult community members etc. And to do it really well - no new overhead - will require time and work (doing it quickly can be done by adding a layer of indirection on either the imports or exports).

But some other options are trivially easy to use:

  • 1 + 3 would require a new API like instantiate(bytes, Memory) -> Promise. The reason this is easy to use is that the Memory can be created early anyhow (while almost all the other imports are JS functions, which we can't have early on).
  • 2 by itself, without 3, i.e., new Instance(bytes, imports). That is, synchronous compilation on binary data + imports. The reason this is easy to use is that our current code does this: instance = new WebAssembly.Instance(new WebAssembly.Module(getBinary()), imports) so we would just fold that into 1 API call.

I think the last option makes sense. It basically means adding a sync version of the new API, making things more symmetrical with our existing synchronous compilation option. And I don't see a reason to tie the new optimization of knowing the Memory at compile time to asynchronous compilation? (async is great, but a separate issue?)

I suggest meta objects that describe the imports (the memory) to break the constraint that the imports need to be allocated before compiling. Then the instantiate method could throw an error if the supplied memory did not match that expected and would not be delayed re-compiling.

It would also be very useful to be able to compile before allocating the linear memory on a memory constrained device, also to be able to optimize compilation for memory allocated at zero in the linear address space, and perhaps for memory with guard zones, and to optimize for memory with a fixed maximum size that can be a power of two plus a spill zone. This meta object could also be used by a wasm user decoder/re-writer to emit code optimized for the negotiated memory characteristics.

This is something being suggested as necessary around the start of this project many years ago now!

@kripken While the sync APIs are, I believe, technically necessary, they definitely should be the not-recommended path or else we'll be introducing tons of unnecessary main-thread jank, a regression compared to asm.js +

Related issues

dpw picture dpw  ·  3Comments

cretz picture cretz  ·  5Comments

void4 picture void4  ·  5Comments

JimmyVV picture JimmyVV  ·  4Comments

artem-v-shamsutdinov picture artem-v-shamsutdinov  ·  6Comments