Design: Discussion: WebAssembly, Unicode and the Web Platform

Created on 22 May 2021  ·  19Comments  ·  Source: WebAssembly/design

This issue is for accompanying discussion of "WebAssembly, Unicode and the Web Platform". The presentation is pre-recorded, which is what we decided to try out in https://github.com/WebAssembly/meetings/pull/775, with discussion time scheduled for June 22nd's CG video meeting.


(click to play)

Please note that I mention some concepts that I would expect to be well known among CG members, but I decided to include them nonetheless to also make the presentation approachable for those unfamiliar with the topic. Feedback welcome!

Related issues:

Most helpful comment

One simple solution would be to have languages/APIs that accept or produce invalid unicode strings use (list u8) or (list u16) (possibly with some nice alias like byte_string to communicate intent) rather than the IT string type, which IIRC is an alias for (list char).

This is currently my preferred solution too - a wtf16string type would be an alias for (list u16) in the same way that string is currently defined as an alias for (list char). The value of the alias, IIUC, is that the result of a function returning (list u16) called by (e.g.) JS would appear as a JS list (of numbers), whereas the result of a function returning wtf16string could be specified as appearing in JS as a JS string.

Adding an additional wtf16string alias to the draft canonical ABI seems to be relatively unintrusive.

All 19 comments

Perhaps a few potential solutions I collected from offline feedback so far, for consideration:

Separate WTF-16

In Interface Types, define:

string   := list char
string16 := list u16

Define a coercion applied during linking, with the following cases:

| From | To | Expectation
|------------|------------|-------------
| string | string16 | Re-encode from UTF-8 to UTF-16
| string16 | string | Re-encode from WTF-16 to UTF-8 (replacement option)

The coercion ensures that a string16 module works on a WASI host, respectively that a string and string16 module can interface with each other, even if both a string and a string16 module or host call the same string or string16 export, which would otherwise be ambiguous.

This one also introduces an ambiguity in the Web embedding in that passing a list u16 to JS could become a Uint16Array or a DOMString. A JS-wide coercion from Uint16Array to DOMString seems undesirable, but the JS type could be hinted by explicitly using the alias string16 (with its own binary id, string16 :> list u16 being purely semantical where required) instead of list u16 in an adapter module. Hence the alias. In this case, string16 would become a DOMString while list u16 would become a Uint16Array.

I am not particularly attached to the name string16 and would be fine with any other name, or any alternative that does not require a name / id to solve the ambiguity.

An optimization akin to list.is_canon is not necessary here, since list.count can be used. Also, the door towards UTF-any and a potential Latin1 optimization, as of below, can be kept open by reserving space for a future immediate in list.*_canon adapter instructions.

UTF-any

In Interface Types, define:

list.lift_canon $unit [...]
list.is_canon $unit [...]
list.lower_canon $unit [...]

where the $unit immediate is
  0: 8-bit (UTF-8, ASCII-compatible)
  1: 16-bit (UTF-16)
  2: 32-bit (UTF-32)
  3: 8-bit (Latin1, narrow UTF-16)

This potential solution can be considered where well-formedness is required. It would avoid double re-encoding overhead and indirect effects on code size, but leaves the surrogate problem unaddressed. Note that $unit 1-3 may be added post-MVP as further optimizations, or we may start with some of them right away.

WTF-any

In Interface Types, define:

list.lift_canon $unit [...]
list.is_canon $unit [...]
list.lower_canon $unit [...]

where the $unit immediate is
  0: 8-bit (WTF-8, ASCII-compatible)
  1: 16-bit (WTF-16)
  2: 32-bit (Code points except surrogate pairs)
  3: 8-bit (Latin1, narrow WTF-16)

This potential solution would also require redefining char from Unicode Scalar Values to Unicode Code Points, while restricting lists of char to not contain surrogate pairs (but allowing isolated surrogates), potentially enforced when lifting. Again, concrete $units in an MVP are arguable.

This one does not introduce lossiness on its own, so everything else indeed becomes just a post-MVP optimization.

Integrated W/UTF-any

In Interface types, define:

  • Lift "list of Unicode Code Points amending surrogate pairs" but lower "list of Unicode Scalar Values". This is a non-functional design-only change when only applied during 16-bit lifting.
  • Add an optional passthrough option when lowering to obtain a "list of Unicode Code Points". This is a functional addition enabling lossless passthrough.

By doing so we achieve:

IIUC, the root issue is that IT wants strings to be sequences of unicode codepoints but some languages consider strings to be sequences of i8 or i16 values that may or may not correspond to well-formed unicode strings. One simple solution would be to have languages/APIs that accept or produce invalid unicode strings use (list u8) or (list u16) (possibly with some nice alias like byte_string to communicate intent) rather than the IT string type, which IIRC is an alias for (list char). Have the trade offs of doing that been discussed anywhere yet?

I think the issue is a little more nuanced, in that IT wants to define char as "Unicode Scalar Values", which have a hole in them where the surrogate code points would be, and as such cannot represent isolated surrogates. WTF, on the other hand, is "Unicode Code Points" without this restriction, but sequences being restricted to not contain surrogate code point pairs (these would be substituted with supplementary code points > U+FFFF, while isolated surrogates are OK). Is this what you meant?

Other than that, I think C-like byte strings as of const char* that can be anything have not been discussed yet. I may have missed it, though.

One simple solution would be to have languages/APIs that accept or produce invalid unicode strings use (list u8) or (list u16) (possibly with some nice alias like byte_string to communicate intent) rather than the IT string type, which IIRC is an alias for (list char).

This is currently my preferred solution too - a wtf16string type would be an alias for (list u16) in the same way that string is currently defined as an alias for (list char). The value of the alias, IIUC, is that the result of a function returning (list u16) called by (e.g.) JS would appear as a JS list (of numbers), whereas the result of a function returning wtf16string could be specified as appearing in JS as a JS string.

Adding an additional wtf16string alias to the draft canonical ABI seems to be relatively unintrusive.

WTF, on the other hand, is "Unicode Code Points" without this restriction, but sequences being restricted to not contain surrogate code point pairs (these would be substituted with supplementary code points > U+FFFF, while isolated surrogates are OK).

Ah, does that mean that WTF-8 is not the same as a plain (list u16) because it has this addition restriction? I hadn't appreciated that nuance. My intuition is that it would be overkill to have both a string type representing sequences of well-formed unicode scalar values as well as a wtf16string type that is almost a (list u16) but has additional restrictions. Would using an alias for an unrestricted (list u16) work well enough for systems that don't enforce unicode well-formedness? This note in the WTF-8 spec suggests that it would.

Ah, does that mean that WTF-8 is not the same as a plain (list u16) because it has this addition restriction?

It states "Like UTF-8 is artificially restricted to Unicode text in order to match UTF-16, WTF-8 is artificially restricted to exclude surrogate code point pairs in order to match potentially ill-formed UTF-16." Iiuc, it treats these similarly to how UTF-8 treats overlong or truncated byte sequences. WTF-8 can represent any (list u16), but not every (list u8) is valid WTF-8.

Would using an alias for an unrestricted (list u16) work well enough for systems that don't enforce unicode well-formedness?

WTF-16 maps 1:1 to random u16 values and it just depends on how these values are interpreted, so yes, (list u16) would work.

IIUC, WTF-8 is not quite the same as arbitrary list u8. For example, it forbids "surrogate pair byte sequences" (see here).

However, WTF-16 _is_ the same as list u16. It's a little weird that they share a naming theme.

EDIT: should have refreshed :)

I posted a first question/answer focused just on the question of surrogates in interface-types/#135. I think that's the high order bit and, if we can agree on that, then a subsequent discussion on supporting one or more encoding formats will be simpler.

Thank you, Luke.

If you'd be willing to support "Separate WTF-16" as of above (the coercion is crucial to enable accessing WASI APIs and to interface with JavaScript without glue code), I would feel comfortable with the suggested char value range. WTF-16 languages would then have the escape hatch they need to integrate as well as it gets with modules written in the same language, JavaScript and by means of replacement with UTF-* languages. I'd also feel much better about WASI btw, as the major pain point introduced by mismatch of string encodings would be resolved with the coercion in place.

Having a separate string16 type like you're suggesting with surrogates would still have all the problems with surrogates outlined in interface-types/#135, so I think it wouldn't be any better to have two string types vs. one (especially if they're implicitly inter-convertible; then they're not meaningfully separate types). Having two string types would also make things concretely worse by introducing a mental burden on every interface designer and consumer ("why are there two types? what's the difference? when should I use one or the other?"). Lastly, adding support for WTF-16 would generally go against the future standards evolution guidance Web/IETF also mentioned in interface-types/#135. Thus, I don't think we should consider adding surrogate-bearing types unless we have actual concrete evidence that Interface Types isn't viable without it.

For Web-exclusive use cases, I think it would make sense to solve the problem in the JS or Web APIs. E.g., it's easy to imagine JS API's for "binding" wasm imports and exports. This is already an approach being taken in other emerging JS APIs, like stack-switching, and I've been wondering if where we're going is general "bind import"/"bind export" JS APIs that are able to handle the Web-specific cases of Promises, JS strings and typed array views.

Having a separate string16 type like you're suggesting with surrogates would still have all the problems with surrogates outlined in interface-types/135

Technically true, but also misses that strings would at least always work in between separately compiled modules in the same language, any compatible language, and JavaScript, even without upfront knowledge of what kind of module one is interfacing with. That's typically the majority of cases I think. As such seems like a reasonable compromise to me, also because it allows dedicating the desired char value range to well-formed (USV) strings.

Having two string types would also make things concretely worse by introducing a mental burden on every interface designer and consumer ("why are there two types? what's the difference? when should I use one or the other?")

The alternative of occasional breakage seems way worse to me, so if that's what it takes I think that most people will be fine with it. Perhaps a good name for the second string type (domstring?), is sufficient to mitigate this minor problem.

Lastly, adding support for WTF-16 would generally go against the future standards evolution guidance Web/IETF

Unfortunately, in the absence of an escape hatch for affected languages it doesn't matter much to me how sound anyone's reasoning of a purported trend is, since as long as the IT MVP is going to break something somewhere for someone, and is largely useless for the JavaScript-like language I am working on, I can only oppose it.

Hence I am trying to find a reasonable solution or compromise everyone can live with, and it would make me happy if we could cooperate.

I don't see how what you're saying addresses the problems raised in interface-types/#135 or provides counterevidence that IT wouldn't be viable in general without the inclusion of a new domstring type. The existing JS API already provides a general-purpose escape hatch for doing arbitrary value conversions at boundaries, so I don't see how a second escape hatch is needed at this early point in time. I think we simply need more experience-based evidence to counteract the strong guidance we've been given against further propagating strings containing surrogates.

(FWIW, If we can agree on the absence of surrogates, I think it would make sense to talk about supporting UTF-16 as an additional encoding in the canonical ABI of string. But that's a whole separate topic with a few options, so I don't want to mix that up with the abstract string semantics which need to be understood first.)

I appreciate your second paragraph, in that it would already solve some very annoying problems. I agree that supporting UTF-16 is useful separately, and I would appreciate it being added to the explainer / MVP. Count me in!

I am having a hard time to follow your arguments in the first paragraph, however. Perhaps if you don't believe me, here is Linus Torvalds explaining a very important rule that I think extends beyond the Linux kernel: Don't break userspace. And here is him in the same talk, upholding the programmer wisdom of If it's a bug that people rely on, it's not a bug, it's a feature, only to continue with:

It's really sad when the most core library in the whole system is OK with breaking stuff as long as things "improve" and they "fix" the ABI.

And not having to worry about surrogates is indeed a sort of feature, in that users can do a careless substring(0, 1) here or there and call an imported function with it, or can split(""), pass along and join() again, or create a StringBuilder as a module that won't occasionally yield double replacement characters as if it were magic. I mean, there is a reason why a bunch of very popular languages opted against enforcing well-formedness, and when Wasm wants to support these languages and its users well, then the more modular Wasm becomes, the more boundary there'll be, the harder it will become to tell in which module a function lives, and the more apparent the problem will become.

I really do not know how much more evidence I need to prove that designing something in a way that ignores the current reality is a bad idea. In fact, this seems to be OK in Interface Types only, while we are keeping every other proposal to very high standards. And while I'm not an expert on this, I think it was the Unicode standard itself making this exact same mistake in regards to the needs of UCS-2 languages by means of insisting on USVs, leading to about a decade worth of similarly desperate discussions (can recommend the entire thread, but especially the last comment before it went silent), in 2014 culminating in the description of the commonly applied practical solution that is the WTF-8 encoding.

Note that emitting a replacement character when encountering character encoding errors in bitstreams is a well-known form of hazardous silent data corruption and systems that require integrity forbid doing that.

Relatedly, if codePointAt would throw an exception when hitting a lone surrogate, you may very well end up with a bug that breaks your whole application because someone accidentally put an emoji character at the wrong position in a string in a database

Unfortunately ecmascript makes it very difficult to ensure you do not generate strings with unpaired surrogate code points somewhere in them, it's as easy as taking the first 157 .length units from a string and perhaps appending "..." to abbreviate it. And it's a freak accident if that actually happens in practise because non-BMP characters are rare. We should be very reluctant to introduce hazards hoping to improve our Unicode hygiene.

The reason JS, Java and C# have the strings they do is that, by the time Unicode realized that 2 bytes wasn't enough and thus UCS-2 wasn't viable, a bunch of code was already written, so these languages simply didn't have a choice. Similarly, for Linux syscalls exposed to userspace. In contrast, no code exists today that is using APIs defined in IT, so we don't have the same backwards-compatibility requirements. For many reasons, wasm and Interface Types are intentionally not seeking to perfectly emulate an existing single language or syscall ABI. It may be a valid goal, but that would be a separate project/standard/layer than the component model. This is the benefit of layering and scoping: we don't need one thing that achieves all possible goals.

I want to reemphasize that of course inside a component, strings can be represented in whatever way is appropriate to the language, so we're really only talking about the semantics of APIs. As for the already-defined Web APIs:

  1. we have plenty of reasons to believe that it's not necessary (and often not meaningful) to pass surrogates
  2. there's always the escape hatch of using custom JS API bindings (it's not a requirement that 100% of Web APIs have a JS-glue-less binding to Web APIs)

Thus, I still don't think we have any evidence suggesting that IT won't be viable without carrying forward these WTF-16 string semantics, which is I think the appropriate question for an MVP.

A couple points I disagree with:

I don't see how what you're saying addresses the problems raised in interface-types/#135

This is a separate issue now, and in the post before I was talking about what I believe to be a reasonable compromise to solve lossiness. In particular, I would be OK with your reasoning in the separate issue, but only when there is a lossless fallback available. This is not an either/or in my opinion. If not, I would remain of the opinion that WTF-8/16 is the more inclusive, less restricted choice, and as such is preferable, also because one of Wasm's high level goals is to integrate seamlessly with the Web platform respectively maintain the backwards-compatible nature of the Web, and that also applies to Interface Types.

The existing JS API already provides a general-purpose escape hatch for doing arbitrary value conversions at boundaries, so I don't see how a second escape hatch is needed at this early point in time.

there's always the escape hatch of using custom JS API bindings

This is sadly not sufficient in our case, where we currently have glue code like:

const STRING_SMALLSIZE = 192; // break-even point in V8
const STRING_CHUNKSIZE = 1024; // mitigate stack overflow
const utf16 = new TextDecoder("utf-16le", { fatal: true }); // != wtf16

/** Gets a string from memory. */
function getStringImpl(buffer, ptr) {
  let len = new Uint32Array(buffer)[ptr + SIZE_OFFSET >>> 2] >>> 1;
  const wtf16 = new Uint16Array(buffer, ptr, len);
  if (len <= STRING_SMALLSIZE) return String.fromCharCode(...wtf16);
  try {
    return utf16.decode(wtf16);
  } catch {
    let str = "", off = 0;
    while (len - off > STRING_CHUNKSIZE) {
      str += String.fromCharCode(...wtf16.subarray(off, off += STRING_CHUNKSIZE));
    }
    return str + String.fromCharCode(...wtf16.subarray(off));
  }
}

First, since we care a lot about Chrome and Node.js, we found that V8's TextDecoder for UTF-16LE is much slower than in other engines (SM's is really fast), so String.fromCharCode is actually faster in V8 up to a certain break even point. So we decided to optimize around it for now. Next, there exists no TextDecoder for WTF-16 (which is separately annoying), so we first try to decode well-formed UTF-16, and if that fails, we let it throw and fall back to chunking through the much slower String.fromCharCode. The chunking is necessary because one cannot simply apply String.fromCharCode on a long string, as that is likely to overflow the stack.

On the other hand, Rust for example would not need this, which is one of the reasons why I think that IT, right now, is not as neutral as it should be. In general I think that the point of IT strings is indeed to be able to interface with JS well, which is still our primary interop target.

no code exists today that is using APIs defined in IT, so we don't have the same backwards-compatibility requirements

The first half is technically true, since IT does not exist yet, but IIUC our requirements do include improving existing use cases, like for example accounting for the clumsy chunk of glue code above. Ideally for as many languages as possible, so post-MVP becomes indeed "just an optimization" as you said in your presentation. On the contrary, right now IT basically starts with what is already an optimization for languages that can make use of an UTF-8 encoder/decoder, which I think is not neutral.

wasm and Interface Types are intentionally not seeking to perfectly emulate an existing single language or syscall ABI

I read this as if I was of this opinion, which I totally am not. I am willing to give you the benefit of the doubt here, but would like to add that in my opinion, IT is currently unnecessarily restricted and as such serves only a very specific set of languages well. On the contrary, WTF-8/16 is the more inclusive encoding that I would have expected to be the logical default, also because it roundtrips to JS strings. We disagree here, but only in the absence of a proper escape hatch. If a viable lossless alternative would exist, so nobody is broken or unnecessarily disadvantaged, I would be fine with your reasoning of the default string type.

we have plenty of reasons to believe that it's not necessary (and often not meaningful) to pass surrogates

We disagree here. In particular, I think my presentation and comments present reasonable doubt that it may, in some cases, even though rare, be very meaningful (say where integrity is required), and I am of the opinion that "We should be very reluctant to introduce hazards hoping to improve our Unicode hygiene." That is, if we can, I believe that we should design the canonical ABI in a way that is guaranteed to work in the following important cases as well: Java/C#/AS<->JS, Java/C#/AS<->Java/C#/AS. Replacement on other paths is probably unavoidable, but at least languages and users have a choice then, respectively the default is not already broken in rare cases.

I still don't think we have any evidence suggesting that IT won't be viable without carrying forward these WTF-16 string semantics

In the presence of reasonable doubt and the absence of willingness to explore what I believe to be a reasonable compromise, I would expect that the burden of proof is now on you. Again, I am willing to leave the default string to you and a well-formed future, but not at the expense of not accounting for what may be rare, but still, hazards. Many popular languages can be affected by this, and that may become really hard to justify in the future once they realize.

I agree that the JS glue code isn't ideal, but I think the right fix for that is in the JS API or in JS, not by adding the concept of wtf-16-string to the whole future component ecosystem. Beyond that, I don't see new information to respond to that hasn't already been responded to; it seems like we're mostly disagreeing on questions of goals/scope.

I would expect the TextDecoder anomaly to be even harder to fix in JS, since it apparently already decided that this is out of scope. And JS can kinda do that, because TextDecoder in JS is not something that is invoked in between two function calls, but mostly used to retrieve data over the network or from storage.

The even more interesting anomaly, is, however, that there isn't even a TextEncoder for UTF-16LE, so one has to do:

/** Allocates a new string in the module's memory and returns its pointer. */
function __newString(str) {
  if (str == null) return 0;
  const length = str.length;
  const ptr = __new(length << 1, STRING_ID);
  const U16 = new Uint16Array(memory.buffer);
  for (var i = 0, p = ptr >>> 1; i < length; ++i) U16[p + i] = str.charCodeAt(i);
  return ptr;
}

As you can see, this is a major pain point for something like Java, C#, AS and others, and both of these would still be necessary when a list u16 is passed. And in the context of this issue it isn't exclusive to the JS API, in that double re-encoding + lossiness in between two modules of the same language isn't so different :(

There's a whole space of options beyond TextEncoder/TextDecoder for how to address that use case on the Web. Another is extending new WebAssembly.Function() (which is already implemented in some browsers) to perform string conversions by adding additional optional parameters to the constructor. Such an approach would also make the functionality available to non-component uses of wasm as well (and potentially much sooner), reinforcing the point that the JS API would be the right place to address this use case.

FYI: Added the "Integrated W/UTF-any" option that came up in https://github.com/WebAssembly/interface-types/issues/135#issuecomment-863493832 to the list of suggestions above :)

Was this page helpful?
0 / 5 - 0 ratings