Design: UTF-8 for all string encodings

Created on 15 Feb 2017  ·  80Comments  ·  Source: WebAssembly/design

Currently:

  • We use var[u]int for most of WebAssembly's binary integer encoding. Consistency is good.
  • We use length + bytes for all "strings" such import / export, and we let the embedder apply extra restrictions as they see fit (and JS.md does). Separation of concerns, and leeway for embedders, are good.

984 opens a can of worms w.r.t. using UTF-8 for strings. We could either:

  • Do varuint for length + UTF-8 for each byte; or
  • Do varuint for number of codepoints + UTF-8 for each codepoint.

I'm not opposed to it—UTF-8 is super simple and doesn't imply Unicode—but I want the discussion to be a stand-alone thing. This issue is that discussion.

Let's discuss arguments for / against UTF-8 for all strings (not Unicode) in this issue, and vote 👍 or 👎 on the issue for general sentiment.

Most helpful comment

I think there is a domain error underlying your argument. None of the strings we are talking about are user-facing. They are dev-facing names. Many/most programming languages do not support Unicode identifiers, nor do tools. Can e.g. gdb handle Unicode source identifiers? I don't think so. So it is quite optimistic (or rather, unrealistic) to assume that all consumers have converged on Unicode in this space.

"dev-facing" means "arbitrary toolchain-facing", which means you need to agree on encoding up-front, or else the tools will have to do encoding "detection" (that is to say, guessing, which is especially bad when applied to short values) or have out-of-band information. Devs are still users. ^_^

If you think a lot of toolchains aren't going to understand Unicode, then I'm unsure why you think they'd understand any other arbitrary binary encoding. If that's your limitation, then just specify and require ASCII, which is 100% supported everywhere. If you're not willing to limit yourself to ASCII, tho, then you need to accept that there's a single accepted non-ASCII encoding scheme - UTF-8.

Saying "eh, most things probably only support ASCII, but we'll let devs put whatever they want in there just in case" is the worst of both worlds.

All 80 comments

Argument for UTF-8: it's very simple. encoder and decoder in JavaScript. Again, UTF-8 is not Unicode.

Argument against UTF-8: it's ever slightly more complicated than length + bytes, leading to potential implementation divergences.

Again, UTF-8 is not Unicode.

What are you even saying? This is a nonsense sentence.

I think you're trying to say that there's no need to pull in an internationalization library. This is true - mandating that strings are encoded in UTF-8 has nothing to do with all the more complicated parts of Unicode, like canonicalization. Those are useful tools when you're doing string work that interfaces with humans, but in the same way that a trig library is useful to people doing math, and not relevant when deciding how to encode integers.

But UTF-8 is literally a Unicode encoding; your statement is meaningless as written. ^_^

But UTF-8 is literally a Unicode encoding; your statement is meaningless as written. ^_^

Yes, I'm specifically referring to the codepoint encoding that UTF-8 describes, not the treatment of codepoints proper (for the purpose of this proposal, a codepoint is an opaque integer). Put in wasm-isms, UTF-8 is similar to var[u]int, but more appropriate to characters. Further, UTF-8 isn't the only Unicode encoding, and it can be used to encode non-Unicode integers. So, UTF-8 isn't Unicode.

A further proposal would look at individual codepoints and do something with them. This is not that proposal.

And there would be no reason to. No Web API has found the need to introspect on the codepoints beyond strict equality comparison and sorting, unless it's literally an i18n API.

Another option is byte length + UTF-8 for each code point ( @jfbastien unless this is what you meant when you said UTF-8 for each byte, which I admit didn't make sense to me). I don't think this would make things any more difficult for a primitive parser that doesn't really care, while allowing a sophisticated Unicode library to take a byte array, offset, and length as input and return a string.

I agree with the definition as "UTF-8 code points", which are just integers. The binary spec should leave it at that. Individual embedders can define rules around allowed code points, normalization and other nuances. Analysis tools could provide warnings for potential compatibility issues.

I think error handling decisions should also be left to the embedders. A system that accessed WASM functions by index rather than name has no need for them to be valid (and they'd be easy to skip over with a byte length prefix).

Here's an attempt at summarizing the underlying issues and their reasons. Corrections and additions are most welcome.

Should wasm require module import/export identifiers be valid UTF-8?

My understanding of the reasons against is:

  • Processing imports and exports is on the critical path for application startup, and there's a desire to avoid anything which would slow it down.
  • The broad invariant "the core wasm spec does not interpret strings". String interpretation is complex in general, and there's a desire to encapsulate it and have broad invariants and boundaries that one can reason about at a high level.
  • WebAssembly decoders are often security-sensitive, so there's a general desire to minimize the amount of code involved.
  • Some WebAssembly producers may want to embed arbitrary data in these identifiers, and it's more convenient for them to encode the data however they want instead of mangling it into string form.

Should wasm recommend UTF-8 in areas where it doesn't require it?

The reason for would be that even if we can't require it, mentioning UTF-8 may discourage needless incompatibilities among the ecosystem.

My understanding of the reason against is that even mentioning UTF-8 would compromise the conceptual encapsulation of string interpretation concerns.

Should wasm specify UTF-8 for name-section names?

The reason for is: The entire purpose of these names is to be converted into strings for display, which is not possible without an encoding, so we should just specify UTF-8 so that tools don't have to guess.

My understanding of the reason against is: If wasm has other string-like things in other areas that don't have a designated encoding (i.e. imports/exports as discussed above), then for consistency sake it shouldn't designate encodings for any strings.

@sunfishcode provides a good summary, but I want to add three crucial points.

@jfbastien, it would be the most pointless of all alternatives to restrict binary _syntax_ (an encoding) but not _semantics_ (a character set) for strings. So for all practical purposes, UTF-8 implies Unicode. And again, this is not just about engines. If you define names to be Unicode, then you are forcing that on all Wasm eco systems in all environments. And that pretty much means that all environments be required to have some Unicode support.

@tabatkins, I think there is a domain error underlying your argument. None of the strings we are talking about are _user-facing_. They are _dev-facing_ names. Many/most programming languages do not support Unicode identifiers, nor do tools. Can e.g. gdb handle Unicode source identifiers? I don't think so. So it is quite optimistic (or rather, unrealistic) to assume that all consumers have converged on Unicode _in this space_.

And finally, the disagreement is not _whether_ Wasm on the Web should assume UTF-8, but _where_ we specify that.

I think there is a domain error underlying your argument. None of the strings we are talking about are user-facing. They are dev-facing names. Many/most programming languages do not support Unicode identifiers, nor do tools. Can e.g. gdb handle Unicode source identifiers? I don't think so. So it is quite optimistic (or rather, unrealistic) to assume that all consumers have converged on Unicode in this space.

"dev-facing" means "arbitrary toolchain-facing", which means you need to agree on encoding up-front, or else the tools will have to do encoding "detection" (that is to say, guessing, which is especially bad when applied to short values) or have out-of-band information. Devs are still users. ^_^

If you think a lot of toolchains aren't going to understand Unicode, then I'm unsure why you think they'd understand any other arbitrary binary encoding. If that's your limitation, then just specify and require ASCII, which is 100% supported everywhere. If you're not willing to limit yourself to ASCII, tho, then you need to accept that there's a single accepted non-ASCII encoding scheme - UTF-8.

Saying "eh, most things probably only support ASCII, but we'll let devs put whatever they want in there just in case" is the worst of both worlds.

Saying "eh, most things probably only support ASCII, but we'll let devs put whatever they want in there just in case" is the worst of both worlds.

@tabatkins, nobody is proposing the above. As I said, the question isn't _whether_ but _where_ to define such platform/environment-specific matters. Wasm is supposed to be embeddable in the broadest and most heterogeneous range of environments, some much richer than others (for example, JS _does_ support Unicode identifiers). Consequently, you want to allow choosing on a per-platform basis. Hence it belongs into platform API specs not the core spec.

There's no choice to make, tho! If your embedding environment doesn't support non-ASCII, you just don't use non-ASCII in your strings. (And if this is the case, you still need encoding assurance - UTF-16 isn't ASCII-compatible, for example!)

If your environment does support non-ASCII, you need to know what encoding to use, and the correct choice in all situations is UTF-8.

What environment are you imagining where it's a benefit to not know the encoding of your strings?

it would be the most pointless of all alternatives to restrict binary syntax (an encoding) but not semantics (a character set) for strings. So for all practical purposes, UTF-8 implies Unicode.

No, it absolutely doesn't. For example, it's perfectly reasonable to simultaneously (a) restrict a string to the ASCII characters, and (b) dictate that it's encoded in UTF-8. Using ASCII characters doesn't imply an encoding, or else all encodings would be ASCII-compatible! (For example, UTF-16 is not.) So you still have to specify something; UTF-8, being "ASCII-compatible", is fine for this.

Again, if you are okay with restricting these names to ASCII-only, then it's reasonable to mandate the encoding be US-ASCII. If you want it to be possible to go beyond ASCII, then it's reasonable to mandate the encoding be UTF-8. Mandating anything else, or not mandating anything at all (and forcing all consumers to guess or use out-of-band information), are the only unreasonable possibilities.

And again, this is not just about engines. If you define names to be Unicode, then you are forcing that on all Wasm eco systems in all environments. And that pretty much means that all environments be required to have some Unicode support.

Again, this looks like you're talking about internationalization libraries. What we're discussing is solely how to decode byte sequences back into strings; that requires just knowledge of how to decode UTF-8, which is extremely trivial and extremely fast.

Unless you're doing human-friendly string manipulation, all you need is the ability to compare strings by codepoint, and possibly sort strings by codepoint, neither of which require any "Unicode support". This is all that existing Web tech uses, for example, and I don't see any reason Wasm environments would, in general, need to do anything more complicated than this.

I'm in favor of mandating utf8 for All The Strings. Pure utf8 decoding/encoding seems like a pretty low impl burden (compared to everything else) for non-Web environments. Also, from what I've seen, time spent validating utf8 for imports/names will be insignificant compared to time spent on everything else, so I don't think there's a performance argument here.

Practically speaking, even if we didn't mandate utf8 in the core wasm spec, you'd have a Bad Time interoperating with anything if your custom toolchain didn't also use utf8 unless you're a total island and then maybe you just say "screw it" and do your own non-utf8 thing anyway... because then who cares.

What I'd realllly like to do, though, is resolve #984, which seems to block on this...

@lukewagner I don't think #984 is blocked on this. 😄

I guess you're right.

What environment are you imagining where it's a benefit to not know the encoding of your strings?

@tabatkins, it seems I've still not been clear enough. I don't imagine such an environment. However, I imagine a wide spectrum of environments with incompatible requirements. Not everything is a subset of UTF-8, e.g. Latin1 is still in fairly widespread use. You might not care, but it is not the job of the core Wasm spec to put needless stones in the way of environment diversity.

you'd have a Bad Time interoperating with anything if your custom toolchain didn't also use utf8 unless you're a total island

@lukewagner, I indeed expect that Wasm will be used across a variety of "continents" that potentially have little overlap. And where they do you can specify interop (in practice, name encodings are likely gonna be the least problem for sharing modules between different platforms -- it's host libraries). Even total islands are not unrealistic, especially wrt embedded systems (which also tend to have little use for Unicode).

One of the most difficult parts of implementing a non-browser based WebAssembly engine is making things work the way it does in the browser (mainly the JS parts). I expect that if the encoding doesn't get standardized, we will end up with a de facto standard where everyone copies what is done for the web target. This will just result in it being harder to find information on how to decode these strings.

There may be value in allowing some environments to further restrict the allowed content, but not requiring UTF-8 will just result in more difficulty.

@MI3Guy, the counter proposal is to specify UTF-8 encoding as part of the JS API. So if you are building a JS embedding then it's defined to be UTF-8 either way and makes no difference for you. (However, we also want to allow for other embedder APIs that are neither Web nor JavaScript.)

Right. My point is if you are not doing a JS embedding, you are forced to emulate a lot of what the JS embedder does in order to use the WebAssembly toolchain.

Do varuint for number of codepoints + UTF-8 for each codepoint.

I'd just like to speak out against this option. It complicates things, doesn't and cannot apply to user-specific sections, and provides no benefit that I can see—in order to know the number of codepoints in a UTF-8 string, in practice you always end up scanning the string for invalid encodings, so you might as well count codepoints while you're at it.

Not everything is a subset of UTF-8, e.g. Latin1 is still in fairly widespread use. You might not care, but it is not the job of the core Wasm spec to put needless stones in the way of environment diversity.

Correct; UTF-8 differs from virtually every encoding once you leave the ASCII range. I'm unsure what your point is with this, tho. Actually using the Latin-1 encoding is bad precisely because there are lots of other encodings that look the same but encode different letters. If you tried to use the name "æther" in your Wasm code, and encoded it in Latin-1, then someone else (justifiably) tries to read the name with a UTF-8 toolchain, they'll get a decoding error. Or maybe the other person was making a similar mistake, but used the Windows-1250 encoding instead (intended for Central/Eastern European languages) - they'd get the nonsense word "ćther".

I'm really not sure what kind of "diversity" you're trying to protect here. There is literally no benefit to using any other encoding, and tons of downside. Every character you can encode in another encoding is present in Unicode and can be encoded in UTF-8, but the reverse is almost never true. There are no relevant tools today that can't handle UTF-8; the technology is literally two decades old.

I keep telling you that web standards settled this question years ago, not because Wasm is a web spec that needs to follow web rules, but because text encoding is an ecosystem problem that pretty much everyone has the same problems with, and the web already dealt with the pain of getting this wrong, and has learned how to do it right. There's no virtue in getting it wrong again in Wasm; every environment that has to encode text either goes straight to UTF-8 from the beginning, or makes the same mistakes and suffers the same pain that everyone else does, and then eventually settles on UTF-8. (Or, in rare cases, develops a sufficiently isolated environment that they can standardize on a different encoding, and only rarely pays the price of communicating with the outside environment. But they standardize on an encoding, which is the point of all this.)

So if you are building a JS embedding then it's defined to be UTF-8 either way and makes no difference for you. (However, we also want to allow for other embedder APIs that are neither Web nor JavaScript.)

This issue has nothing to do with the Web or JS. Every part of the ecosystem wants a known, consistent text encoding, and there's a single one that is widely agreed upon across programming environments, countries, and languages: UTF-8.

I vote for 'Do varuint for length (in bytes) + UTF-8 for each byte'. Assuming that's not a controversial choice - pretty much every string implementation stores strings as "number of code units" rather than "number of code points", because it's simpler - then isn't the real question "should validation fail if a string is not valid UTF-8"?

As I pointed out in #970, invalid UTF-8 can be round-tripped to UTF-16, so if invalid UTF-8 is allowed, software that doesn't want to store the original bytes doesn't have to. On the other hand, checking if UTF-8 is valid isn't hard (though we must answer - should overlong sequences be accepted? surrogate characters?)

On the whole I'm inclined to say let's mandate UTF-8. In the weird case that someone has bytes they can't translate to UTF-8 (perhaps because the encoding is unknown), arbitrary bytes can be transliterated to UTF-8.

I'm really not sure what kind of "diversity" you're trying to protect here.

@tabatkins, yes, that seems to be the core of the misunderstanding.

It is important to realise that WebAssembly, despite its name, is not limited to the web. We are very cautious to define it in suitable layers, such that each layer is as widely usable as possible.

Most notably, its _core_ is not actually a web technology _at all_. Instead, try to think of it as a _virtual ISA_. Such an abstraction is useful in a broad spectrum of different environments, from very rich (the web) to very rudimentary (embedded systems), that do not necessarily have anything to do with each other, may be largely incompatible, and have conflicting constraints (that Wasm is in no position to change).

As such, it makes no more sense to impose Unicode on _core_ Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

There will, however, be additional spec layers on top of this core spec that define its embedding and API in _concrete_ environments (such as JavaScript). It makes perfect sense to fix string encodings on that level, and by all means, we should.

PS: A slogan that defines the scope of Wasm is that it's an abstraction over common hardware, not an abstraction over common programming languages. And hardware is agnostic to software concerns like string encodings. That's what ABIs are for.

@rossberg-chromium

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

I agree 100%. This issue isn't about Unicode though, it's purely about UTF-8, an encoding for integers, without mandating that the integers be interpreted as Unicode.

I don't understand if we agree on that. Could you clarify: are you OK with UTF-8, and if not why?

@jfbastien, would it be any more productive to require UTF-8 conformance for all C string literals?

As I noted earlier, it makes no sense to me to restrict the encoding but not the character set. That's like defining syntax without semantics. Why would you possibly do that? You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

@jfbastien, would it be any more productive to require UTF-8 conformance for all C string literals?

I don't understand, can you clarify?

As I noted earlier, it makes no sense to me to restrict the encoding but not the character set. That's like defining syntax without semantics. Why would you possibly do that? You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

I think that the crux of the discussion.

@tabatkins touched on precedents to exactly this:

Again, this looks like you're talking about internationalization libraries. What we're discussing is solely how to decode byte sequences back into strings; that requires just knowledge of how to decode UTF-8, which is extremely trivial and extremely fast.

Unless you're doing human-friendly string manipulation, all you need is the ability to compare strings by codepoint, and possibly sort strings by codepoint, neither of which require any "Unicode support". This is all that existing Web tech uses, for example, and I don't see any reason Wasm environments would, in general, need to do anything more complicated than this.

So I agree: this proposal is, in your words, "defining syntax without semantics". That's a very common thing to do. In fact, WebAssembly's current length + bytes specification already does this!

I'd like to understand what the hurdle is. I don't really see one.

It is important to realise that WebAssembly, despite its name, is not limited to the web.

I just stated in the immediately preceding comment that this has nothing to do with the web. You keep trying to use this argument, and it's really confusing me. What I'm saying has nothing to do with the web; I'm merely pointing to the web's experience as an important example of lessons learned.

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

You're not making the point you think you're making - C does have a built-in encoding, as string literals use the ASCII encoding. (If you want anything else you have to do it by hand by escaping the appropriate byte sequences.) In more current C++ you can have UTF-16 and UTF-8 string literals, and while you can still put arbitrary bytes into the string with \x escapes, the \u escapes at least verify that the value is a valid codepoint.

All of this is required, because there is no inherent mapping from characters to bytes. That's what an encoding does. Again, not having a specified encoding just means that users of the language, when they receive byte sequences from other parties, have to guess at the encoding to turn them back into text.

You gain zero in terms of interop but still erect artificial hurdles for environments that do not use UTF-8 (which only Unicode environments do anyway).

Can you please point to an environment in existence that uses characters that aren't included in Unicode? You keep trying to defend this position from a theoretical purity / environment diversity standpoint, but literally the entire point of Unicode is to include all of the characters. It's the only character set that can make a remotely credible argument for doing so, and when you're using the Unicode character set, UTF-8 is the preferred universal encoding.

What diversity are you attempting to protect? It would be great to see even a single example. :/

@tabatkins:

It is important to realise that WebAssembly, despite its name, is not
limited to the web.

I just stated in the immediately preceding comment that this has nothing
to do with the web. You keep trying to use this argument, and it's really
confusing me. What I'm saying has nothing to do with the web; I'm merely
pointing to the web's experience as an important example of lessons learned.

What I am trying to emphasise is that Wasm should be applicable to as many
platforms as possible, modern or not. You keep arguing from the happy end
of the spectrum where everything is Unicode and/or UTF-8, and everything
else is just deprecated.

You're not making the point you think you're making - C does have a

built-in encoding, as string literals use the ASCII encoding. (If you want
anything else you have to do it by hand by escaping the appropriate byte
sequences.) In more current C++ you can have UTF-16 and UTF-8 string
literals, and while you can still put arbitrary bytes into the string with
\x escapes, the \u escapes at least verify that the value is a valid
codepoint.

No, that is incorrect. The C spec does not require ASCII. It does not even
require compatibility with ASCII. It allows almost arbitrary "source
character sets" and string literals can contain any character from the full
set. There are no constraints regarding encoding, it is entirely
implementation-defined. There have been implementations of C running on
EBCDIC platforms, and that is still supported by the current standard. GCC
can process sources in any iconv encoding (of which there are about 140
besides UTF-8), e.g. UTF-16 which is popular in Asia. C++ is no different.

(That should also answer @jfbastien's question.)

All of this is required, because there is no inherent mapping from
characters to bytes
. That's what an encoding does. Again, not having a
specified encoding just means that users of the language, when they receive
byte sequences from other parties, have to guess at the encoding to turn
them back into text.

Again: this _will_ be suitably specified per environment. When somebody
receives a Wasm module from somebody else operating in the same ecosystem
then there is no problem. No JS dev will ever need to care.

If, however, somebody is receiving a module from _another ecosystem_ then
there are plenty of other sources of incompatibility to worry about, e.g.
expectations about API, built-in libraries, etc. Both parties will need to
be explicit about their interop assumptions anyway. Agreeing on a name
encoding is gonna be the least of their problems.

You gain zero in terms of interop but still erect artificial hurdles for

environments that do not use UTF-8 (which only Unicode environments do
anyway).

Can you please point to an environment in existence that uses
characters that aren't included in Unicode? You keep trying to defend this
position from a theoretical purity / environment diversity standpoint, but
literally the entire point of Unicode is to include all of the
characters
. It's the only character set that can make a remotely
credible argument for doing so, and when you're using the Unicode character
set, UTF-8 is the preferred universal encoding.

What diversity are you attempting to protect? It would be great to see even
a single example. :/

For example, here is a list of embedded OSes: https://en.wikipedia.org/wiki/
Category:Embedded_operating_systems
Some of them likely use UTF-8, some won't. Some may find a use for Wasm,
most probably won't. But there is no benefit for us in making it less
convenient for them.

One entry from that list that you're probably still familiar is DOS. As
much as we all like it to die, DOS systems are still lively, and they use
OEM.

@jfbastien:

So I agree: this proposal is, in your words, "defining syntax without
semantics". That's a very common thing to do. In fact, WebAssembly's
current length + bytes specification already does this!

The rare occurrences of such a thing that I am aware of all have to do with
providing an escape hatch for implementation-specific behaviour. That's
also the only reasonable use case. That makes no sense here, though. If you
want to provide such an escape hatch for strings, then why bother requiring
UTF-8, instead of allowing any byte string "syntax"? That's syntax without
semantics as a disabler, not an enabler.

I'd like to understand what the hurdle is. I don't really see one.
>
That some clients cannot simply use all byte values but have to go through
redundant UTF encodings that have no use in their eco system. That all
tools in their tool chains will have to bother with it as well. That it
creates additional error cases (out of range values) that wouldn't
otherwise exist for them.

Let me ask the other way round: What is the benefit (in their eco systems)?
I don't really see one.

@tabatkins
Want to make sure I understand where the dividing line lies.
To be clear you're suggesting ONLY utf-8 encoding of code points regardless of if they're invalid in combination (that can be done in 10 lines of code).
Bold caps could for instance be used in the spec to indicate: You're doing something wrong if you think you need an internationalization library to implement Wasm?

Goals of this would be:

  • Ensure any valid wasm that ends up on the web can at least display tofu characters for invalid stuff.
  • Encourage tools that generate wasm (even in contexts outside the web) to prefer unicode over other encodings when they need to go beyond ascii. (A soft bump in this direction as full validation doesn't happen).

Questions?

  • Is there any danger this becomes a creeping requirement for more validation? I think my core concern in this space would be it will always be an unreasonable burden to swallow say ICU as a dependency.
  • I assume this implies the goal of actively encouraging encodings like Latin1 that clash with UTF-8? I.e. toolchains that emit it would be non-compliant, implementations that accept it similarly so.

  • I grok the web has historically had trouble unifying this space due to overlapping use of bits from regions that previously were encoding islands. On the other hand, my impression is that UTF-8 sets up things such that the costs of the transition are disproportionately born by non-ASCII folks, and that some regions have more bake in. I would imagine the unicode transition is a practical inevitability (and nearly complete). Is there some centralized doc / entity we can point at to that addresses how some of the political and regional issues around unicode have been resolved on the web?

@rossberg-chromium

  • I see the logical inconsistency in validating some aspects of an encoding but not others. On the other hand my impression is utf8 is pervasive at this point (and that a small nudge in tools + validation has low cost). Is your main discomfort adding the bare utf-8 validation to the spec the inconsistency or something else?

To be clear you're suggesting ONLY utf-8 encoding of code points regardless of if they're invalid in combination (that can be done in 10 lines of code).

Yes, tho I don't believe there are any invalid combinations; there are just some individual codepoints (the ones reserved for UTF-16 surrogates) that are technically invalid to encode as UTF-8. That said, if full byte control is desirable, the WTF-8 encoding does exist, but we should be very explicit about "yes, we want to allow these strings to actually contain arbitrary non-string data in them sometimes" as a goal if we go that way. The WTF-8 (and WTF-16) format is only intended to provide a formal spec for environments that have backwards-compat constraints on enforcing UTF-* well-formedness.

Bold caps could for instance be used in the spec to indicate: You're doing something wrong if you think you need an internationalization library to implement Wasm?

Yes, i18n isn't required in any way, shape, or form. CSS defaults to UTF-8, for example, and just does raw codepoint comparison/sorting when it allows things outside the ASCII range. No reason for Wasm to go any further than this, either.

Is there any danger this becomes a creeping requirement for more validation? I think my core concern in this space would be it will always be an unreasonable burden to swallow say ICU as a dependency.

The web platform has never needed to impose additional validation on bare names so far. My experience suggests it will never be necessary.

I assume this implies the goal of actively [dis -ed]couraging encodings like Latin1 that clash with UTF-8? I.e. toolchains that emit it would be non-compliant, implementations that accept it similarly so.

Yes, with the change to "discouraging" in your words. ^_^ The whole point is that producers and consumers can reliably encode and decode strings to/from byte sequences without having to guess at what the other endpoint is doing. This has been a horrible pain for every environment that has ever encountered it, and there's a widely-adopted solution for it now.

I grok the web has historically had trouble unifying this space due to overlapping use of bits from regions that previously were encoding islands. On the other hand, my impression is that UTF-8 sets up things such that the costs of the transition are disproportionately born by non-ASCII folks, and that some regions have more bake in. I would imagine the unicode transition is a practical inevitability (and nearly complete). Is there some centralized doc / entity we can point at to that addresses how some of the political and regional issues around unicode have been resolved on the web?

Yes, it definitely had issues in the transition; HTML is still required to default to Latin-1 due to back-compat, and there are still some small pockets of web content that prefer a language-specific encoding (mostly Shift-JIS, a Japanese-language encoding). But the vast majority of the world switched over the last two decades, and the transition is considered more or less complete now.

The "UTF-8 burdens non-ASCII folks" has been a pernicious, but almost entirely untrue, rumor for a long time. Most European languages include the majority of the ASCII alphabet in the first place, so most of their text is single-byte sequences and ends up smaller than UTF-16. The same applies to writing systems like Pinyin. CJK langs mostly occupy the 3-byte UTF-8 region, but they also include large amounts of ASCII characters, particularly in markup languages or programming languages, so also, in general, see either smaller or similar encoded sizes for UTF-8 as for UTF-16 or their specialized encodings.

It's only for large amounts of raw text in CJK or non-ASCII alphabets such as Cyrillic that we see UTF-8 actually take up more space than a specialized encoding. These were concerns, however, in the early 90s, when hard drive capacity was measured in megabytes and a slight blow-up in text file sizes was actually capable of being significant. This hasn't been a concern for nearly 20 years; the size difference is utterly inconsequential now.

Wrt to "the Unicode transition", that has already happened pretty universally. A text format that doesn't require itself to be encoded with UTF-8 these days is making a terrible, ahistoric mistake.

I'm not sure of any specific document that outlines this stuff, but I'll bet they exist somewhere. ^_^

If the goal is to keep the binary spec as pure as possible, let's remove names entirely. All its internal references are based on index, anyway.

Instead, add a mandatory custom section to the JavaScript specification that requires UTF-8. Other environments, such as the Soviet-era mainframe that @rossberg-chromium is alluding to, can define their own custom section. A single WASM file could support both platforms by providing both custom sections. It would be relatively straightforward for custom tooling to generate an obscure platform's missing section by converting a more popular one.

If the goal is to keep the binary spec as pure as possible, let's remove names entirely. All its internal references are based on index, anyway.

That's a rework of how import / export works. It's not on the table and should be suggested in a different issue than this one.

@bradnelson, AFAICS, prescribing a specific encoding but no character set
combines the worst of both worlds: it imposes costs in terms of
restrictions, complexity, and overhead with no actual benefit in terms of
interop. I guess I'm still confused what the point would be.

@rossberg-chromium The primary benefit being sought here is to relieve tools and libraries from the burden of guessing.

Since the primary benefit being sought here is to relieve tools and libraries from the burden of guessing, any of the above variants being discussed (UTF-8 vs. WTF-8 etc.) would be better than nothing because even in the worst case, "I'm positive I can't transcode these bytes literally" is better than "these bytes look like they might be windows-1252; maybe I'll try that". Guessing is known to be error prone, and the primary benefit being sought here is to relieve tools and libraries from the burden of guessing.

@sunfishcode, how? I'm still lost.

So here is a concrete scenario. Suppose we are on different platforms and I am trying to pass you a module. Suppose for the sake of argument that my platform uses EBCDIC and yours ASCII. Totally legit under the current proposal. Yet, my module will be completely useless to you and your tool chain.

Both these encodings are 7 bit, so UTF-8 doesn't even enter the picture.

So what would UTF-8 bring to the table? Well, I could "decode" any unknown string I get. But for all I know, the result is _just another opaque binary blob_ of 31 bit values. It doesn't provide any information. I have no idea how to relate it to my own strings.

So, then, why would I even bother to decode an unknown string? Well, _I wouldn't_! I could just as well work with the original binary blob of 8 bit values and save space and cycles. The spec would still require me to spend cycles to vacuously validate the encoding, though.

Considering all that, what would (core) Wasm or tools gain by adopting this particular proposal?

AFAICS, prescribing a specific encoding but no character set
combines the worst of both worlds: it imposes costs in terms of
restrictions, complexity, and overhead with no actual benefit in terms of
interop. I guess I'm still confused what the point would be.

We're definitely imposing a character set - the Unicode character set. JF was phrasing things very confusingly earlier, pay no attention. That doesn't mean we need to add checks to Wasm to actually enforce this; decoders are typically robust enough to deal with invalid characters. (The web, for example, typically just replaces them with U+FFFD REPLACEMENT CHARACTER.)

So here is a concrete scenario. Suppose we are on different platforms and I am trying to pass you a module. Suppose for the sake of argument that my platform uses EBCDIC and yours ASCII. Totally legit under the current proposal. Yet, my module will be completely useless to you and your tool chain.

You need to stop pretending multi-decades old systems are not only relevant, but so relevant that they justify making decisions that go against everything we've learned about encoding pain over those same multiple decades. You're helping no one with this insistence that Web Assembly contort itself to maximize convenience when chattering with ancient mainframes, while ignoring the benefit from everyone else in the world being able to communicate textual data reliably. You're just going to hurt the language and make 99.9% (as a very conservative estimate) of users' lives harder.

Many different systems went thru all of this mess. The encoding wars were not fun; they wasted a lot of money and a lot of time and resulted in a lot of corrupted text. We finished those wars, tho. Unicode was created, and promulgated, and became the dominant character set across the entire world, to the point that all other character sets are literally nothing more than historical curiosities at this point. We still have low-level simmering fights over whether to use UTF-16 vs UTF-8, but at least those two are usually easy to tell apart (look at the BOM, or look for a preponderance of null bytes), and overall UTF-8 dominates handily.

Your insistence on encoding freedom ignores all of this history, all the lessons learned in the two decades since Unicode was introduced. It ignores all the experience and expertise that have gone into designing modern systems, which have had the effect of making encoding issues invisible to most users, because systems can count on everything being encoded in a particular way. You are going to create serious, pernicious, expensive problems if you persist in this, one mojibake at a time.

@rossberg-chromium

So here is a concrete scenario. Suppose we are on different platforms and I am trying to pass you a module. Suppose for the sake of argument that my platform uses EBCDIC and yours ASCII. Totally legit under the current proposal. Yet, my module will be completely useless to you and your tool chain.

So what would UTF-8 bring to the table? Well, I could "decode" any unknown string I get. But for all I know, the result is just another opaque binary blob of 31 bit values. It doesn't provide any information. I have no idea how to relate it to my own strings.

UTF-8 would tell you exactly how to relate it to your own strings. That's exactly the problem that it solves. (WTF-8 would too when it can, and it would tell you unambiguously when it can't.)

Do you mean an arbitrary data structure mangled into string form and then encoded as UTF-8? It's true that you wouldn't be able to demangle it, but you could at least unambiguously display the mangled name as a string, which is an improvement over not having anything for some use cases.

Do you mean the discussion above about using UTF-8 as an encoding of opaque integers and not Unicode? I think the discussion has gotten somewhat confused. It's tempting to call encoding "syntax" and internationalization "semantics", but that obscures a useful distinction: UTF-8 can still say that a certain byte sequence means "Ö" without saying what consumers have to do with that information. Used in this way, it is an encoding of Unicode, but it doesn't require the kind of cost that "Unicode Support" has been used to suggest above.

So, then, why would I even bother to decode an unknown string? Well, I wouldn't! I could just as well work with the original binary blob of 8 bit values and save space and cycles. The spec would still require me to spend cycles to vacuously validate the encoding, though.

I've now built a SpiderMonkey with full UTF-8 validation of wasm import/export identifiers, including overlong and surrogates. I was unable to detect a performance difference in WebAssembly.validate, either on AngryBots, or on a small emscripten-compiled testcase that nonetheless has 30 imports.

The spec is a compromise between multiple concerns. I appreciate the concern of startup time, so I've now conducted some experiments and measured it. I encourage others to do their own experiments.

Further, UTF-8 isn't the only Unicode encoding, and it can be used to encode non-Unicode integers. So, UTF-8 isn't Unicode.

Which integers can UTF-8 encode that are not part of Unicode (i.e., outside the range U+0000 to U+10FFFF)? That statement seems false.

If you don't validate your characters, you can encode any 21-bit integer.

Not quite sure why we wouldn't validate...

@flagxor https://encoding.spec.whatwg.org/ describes the various encodings exposed to the web. Note that none of them go outside the Unicode character set, but they're obviously not all byte-compatible with each other.

What would "validation" do? Make your wasm program invalid? I don't think there's any actual consequences that can be reasonably imposed.

Like, using an invalid escape in CSS just puts a U+FFFD into your stylesheet, it doesn't do anything weird.

@annevk:

Further, UTF-8 isn't the only Unicode encoding, and it can be used to encode non-Unicode integers. So, UTF-8 isn't Unicode.

Which integers can UTF-8 encode that are not part of Unicode (i.e., outside the range U+0000 to U+10FFFF)? That statement seems false.

At a minimum: U+FFFE and U+FFFF are noncharacters in Unicode. The codepoints (the integers values) will never be used by Unicode to encode characters, but they can be encoded in UTF-8.

They are still Unicode code points though. I wouldn't focus too much on "characters".

@tabatkins decoding to U+FFFD is reasonable, but that limits the number of integers you can get.

As such, it makes no more sense to impose Unicode on core Wasm than it would, say, to impose Unicode on all string literals in the C programming language. You'd only coerce some potential clients into violating this bit of the standard. What's the gain?

You might take note that C11 added char16_t and char32_t types as well as a u prefix for UTF-16-encoded string literals, a U prefix for UCS-4-encoded string literals, and a u8 prefix for UTF-8 encoded string literals. I didn't dig quite deep enough to find their rationale for adding them, but I assume "dealing with Unicode in standard C/C++ is a nightmare" is at least part of the motivation.

@tabatkins, @sunfishcode, okay, so you are not talking about the same thing. But AFAICT @jfbastien has been stating explicitly and repeatedly that his proposal is about specifying UTF-8 without the Unicode character set.

That also is the only interpretation under which the claim of low cost holds up.

Because if we actually _do_ assume that UTF-8 implies Unicode then this requirement certainly is much more expensive than just UTF-8 encoding/decoding for any tool on any system that does not yet happen to talk (a subset of) Unicode -- they'd need to include a full transcoding layer.

@tabatkins, core Wasm will be embedded in pre-existing systems -- sometimes for other reasons than portability -- that it has no power to change or impose anything on. If they face the problems you describe then those exist independent of Wasm. _We_ cannot fix _their_ problems.

The likely outcome of _trying_ to impose Unicode on all of them would be that some potential ones will simply violate that part of the specification, rendering it entirely moot (or worse, they'll disregard Wasm altogether).

If OTOH we specify it at an adequate layer then we don't run that risk -- without losing anything in practice.

Because if we actually do assume that UTF-8 implies Unicode then this requirement certainly is much more expensive than just UTF-8 encoding/decoding for any tool on any system that does not yet happen to talk (a subset of) Unicode -- they'd need to include a full transcoding layer.

What platforms exist that use a native character set that's not Unicode, not ASCII, have no facilities for converting those characters to/from Unicode, and would need to use non-ASCII identifiers in Wasm? (I mean really exist, not some hypothetical Russian organization that decides to use Wasm in DOS.)

@rocallahan I believe @rossberg-chromium is concerned (or at least I would be) with devices like embedded systems, which would not want the added cost of a full ICU library. They would either be forced to accept bloat, not do full validation, or not accept wasm files containing non-ascii character (which they might not have control over).

Also, strictly speaking, such devices often include hardware that do have non-standard character sets like:
https://www.crystalfontz.com/product/cfah1602dyyhet-16x2-character-lcd?kw=&origin=pla#datasheets
https://www.crystalfontz.com/products/document/1078/CFAH1602DYYHET_v2.1.pdf
(Which has a goofy mixed ascii + latin1 + japanese character set)
But the concern is what are you obliged to validate, which is relevant regardless.

@tabatkins though I thought has indicated that the intent is:

  • Mandate UTF-8 + Unicode as the only "correct" interpretation of the bytes
  • Explicitly state the Unicode does not have to validate for the module to validate (to save the cost)

I believe @rossberg-chromium is concerned (or at least I would be) with devices like embedded systems, which would not want the added cost of a full ICU library. They would either be forced to accept bloat, not do full validation, or not accept wasm files containing non-ascii character (which they might not have control over).

As repeatedly stated, this is a red herring. There is no need to do anything remotely ICU related; the web definitely doesn't do so. Please stop spreading this incorrect information.

"Full validation" is an extremely trivial operation, done automatically as part of a conforming UTF-8 decode operation.

In chatting with @tabatkins, one thing that I think is crucial to be clear on here:
A conforming Unicode decoder is REQUIRED to allow for arbitrary combinations of modifiers unallocated code points etc. So a stray mix of modifiers etc, even through it doesn't render to something sensible, is required to be permitted by Unicode. A decoder that rejected nonsense combinations would be non-compliant.

So the requirement to properly UTF-8 decode, is crisply scoped to be something you can do in a handful of lines of code, is an exact operation, and is essentially equivalent to specifying a unicode + utf-8 interpretation of the bytes.

Yes. Parsing UTF-8 is extremely trivial; the only complications are the handful of codepoints that you aren't allowed to encode in UTF-8, which a compliant decoder will instead parse as one or more U+FFFD characters.

But that's an operation for the endpoint to do. Wasm doesn't have to concern itself with any of this; compliant decoders can handle any arbitrary bit-pattern you throw at them. (They'll just decide most of a garbage bit-pattern is U+FFFD characters.) All I've been asking for, this whole time, is for an author-level conformance requirement that these strings be encoded with UTF-8. If you violate that, your toolchain can flag it as an error, but there's nothing that Wasm itself needs to do.

This is similar to, for example, CSS defining a grammar for what constitutes a valid stylesheet, but still technically accepting any arbitrary pattern of bits.

Also, strictly speaking, such devices often include hardware that do have non-standard character sets like:

The existence of such character sets is irrelevant to Wasm unless you expect people to write Wasm identifiers in the (non-ASCII ranges of) them.

Right, all "use UTF-8" means is https://encoding.spec.whatwg.org/#utf-8-decoder. ICU is not even close to a requirement.

On 25 February 2017 at 01:13, Brad Nelson notifications@github.com wrote:

In chatting with @tabatkins https://github.com/tabatkins, one thing
that I think is crucial to be clear on here:
A conforming Unicode decoder is REQUIRED to allow for arbitrary
combinations of modifiers unallocated code points etc. So a stray mix of
modifiers etc, even through it doesn't render to something sensible, is
required to be permitted by Unicode. A decoder that rejected nonsense
combinations would be non-compliant.

So the requirement to properly UTF-8 decode, is crisply scoped to be
something you can do in a handful of lines of code, is an exact operation,
and is essentially equivalent to specifying a unicode + utf-8
interpretation of the bytes.

To clarify what I said. I don't dispute that full ICU probably wouldn't be
necessary (although e.g. sorting names by code points sounds like bad
usability).

However, the claim that only trivial decoding remains is not correct
either, because it doesn't stop with validation. Non-Unicode platforms
would be forced to perform transcoding to actually handle their strings.
Moreover, they would have to deal with the problem of characters that
cannot be mapped (in either direction), so you'd still have compatibility
issues in general, just kicked the can down the road.

>

Also, strictly speaking, such devices often include hardware that do have
non-standard character sets like:

The existence of such character sets is irrelevant to Wasm unless you
expect people to write Wasm identifiers in the (non-ASCII ranges of) them.

@rocallahan https://github.com/rocallahan, they still must be able to
take in arbitrary Unicode. But what would they do with it? If a Wasm
implementation on such a platform restricted to ASCII then it would be
violating the proposed spec. (I'd also consider that implying that
somebody's non-ASCII characters are irrelevant a priori may be culturally
questionable. That should be their's to decide.)

Moreover, they would have to deal with the problem of characters that cannot be mapped (in either direction), so you'd still have compatibility issues in general, just kicked the can down the road.

Is this a theoretical concern?

And if it's a reasonable concern, we must once again weigh the (occurence * cost) of dealing with that against the cost of virtually every other user of Wasm in the world not being able to depend on an encoding, and having to deal with the same encoding-hell the web platform had to go thru, and eventually fixed as well as it could.

Non-Unicode platforms would be forced to perform transcoding to actually handle their strings.

In what cases do Wasm strings need to interoperate with platform strings, though? As far as I can tell we're only talking about the encoding of strings in the Wasm metadata, not the encoding of strings manipulated by actual module code. (If that's wrong, I apologize...) Then I can only think of a few possible cases where interop/transcoding might be required:

  • A Wasm module imports a platform identifier
  • The platform imports a Wasm identifier
  • You to extract Wasm names and print them or save them using platform strings, e.g. to dump a stack trace.

Right?

For hypothetical non-Unicode embedded systems, for the first two cases, the advice is simple: limit identifiers imported across the platform boundary to ASCII, then the required transcoding is trivial. Wasm modules could still use full Unicode names internally and for linking to each other.

For the third issue --- if you have a closed world of Wasm modules, you can limit their identifiers to ASCII. If not, then in practice you'll encounter UTF8 identifiers and you'd better be able to transcode them, and you'll be glad the spec mandated UTF8!

implying that somebody's non-ASCII characters are irrelevant a priori

That is a straw-man argument. The position here is "if you want non-ASCII identifiers, use Unicode or implement transcoding to/from Unicode", and it has not attracted criticism as "culturally questionable" in other specs, AFAIK.

>

And if it's a reasonable concern, we must once again weigh the (occurence

  • cost) of dealing with that against the cost of virtually every other
    user of Wasm in the world
    not being able to depend on an encoding, and
    having to deal with the same encoding-hell the web platform had to go thru,
    and eventually fixed as well as it could.

@tabatkins, no, again (and somehow I feel like I have repeated this 100
times already): every embedding spec _will_ specify an encoding and
character set. On every platform you can rely on this. You'd only ever run
into encoding questions if you tried to interoperate between two unrelated
eco systems -- which will already be incompatible for deeper reasons than
strings. And this would only affect interop with platforms you'd otherwise
exclude entirely. So you _do not lose anything_ but win the ability to use
Wasm on more diverse platforms.

You are software engineers. As such I assume you understand and appreciate
the value of modularisation and layering, to separate concerns and maximise
reuse. That applies to specs as well.

>

Non-Unicode platforms would be forced to perform transcoding to actually
handle their strings.

In what cases do Wasm strings need to interoperate with platform strings,
though? As far as I can tell we're only talking about the encoding of
strings in the Wasm metadata, not the encoding of strings manipulated by
actual module code. (If that's wrong, I apologize...) Then I can only think
of a few possible cases where interop/transcoding might be required:

  • A Wasm module imports a platform identifier
  • The platform imports a Wasm identifier
  • You to extract Wasm names and print them or save them using platform
    strings, e.g. to dump a stack trace.

Right?

Yes. In other words, every time you actually need to _use_ a string.

For hypothetical non-Unicode embedded systems, for the first two cases,
the advice is simple: limit identifiers imported across the platform
boundary to ASCII, then the required transcoding is trivial. Wasm modules
could still use full Unicode names internally and for linking to each other.

For the third issue --- if you have a closed world of Wasm modules, you
can limit their identifiers to ASCII. If not, then in practice you'll
encounter UTF8 identifiers and you'd better be able to transcode them, and
you'll be glad the spec mandated UTF8!

Under the proposal you would not be allowed to limit anything to ASCII! To
allow that the core spec would need to be more allowing. So you are making
my point.

every embedding spec _will_ specify an encoding and character set. On every platform you can rely on this. You'd only ever run into encoding questions if you tried to interoperate between two unrelated eco systems -- which will already be incompatible for deeper reasons than strings.

What about Wasm processing tools such as disassemblers? Wouldn't it be valuable to be able to write a disassembler that works with any Wasm module regardless of "embedding spec" variants?

Under the proposal you would not be allowed to limit anything to ASCII!

Under the proposal, Wasm modules would not be limited to ASCII, but if an implementer chose to make all their identifiers defined outside Wasm modules ASCII (e.g. as pretty much all system libraries actually do!), that would be outside the scope of the Wasm spec.

If an implementer chose to print only ASCII characters in a stack trace and replace all non-ASCII Unicode characters with ? or similar, that has to be allowed by the spec, since in practice there always exist Unicode characters you don't have a font for anyway.

Having said all that, defining a subset of Wasm in which all Wasm names are ASCII would be fairly harmless since such Wasm modules would be processed correctly by tools that treat Wasm names as UTF8.

You are software engineers. As such I assume you understand and appreciate the value of modularisation and layering, to separate concerns and maximise reuse. That applies to specs as well.

Yes, I'm a software engineer. I'm also a spec engineer, so I understand the value of consistency and establishing norms that make the ecosystem work better. Character sets and encodings are one of the subjects where the value of allowing modularization and choice are vastly outweighed by the value of consistency and predictability. We have literal decades of evidence of this. This is why I keep repeating myself - you're ignoring history and the recommendation of many experts, several of which have shown up in this very thread, and many more which I'm representing the opinions of, when you insist that we need to allow freedom in this regard.

After reading this whole (long) thread, I think the only way to resolve this discussion is to explicitly specify that the names section we are describing in the binary format and are enhancing in https://github.com/WebAssembly/design/pull/984 is a UTF-8 encoding, and I would propose that we simply call that section "utf8-names". That makes the encoding explicit, and almost certainly all tools that want to manipulate WASM binaries on all relevant platforms today want to speak UTF-8 anyway. They could be forgiven for speaking only UTF-8.

I am sensitive to @rossberg-chromium's concerns for other platforms, and to some extent, I agree. However, this is easily fixable. As someone suggested earlier in the thread, those systems are more than welcome to add a non-standard "ascii-names" section or any other encoding that their ecosystem uses. With explicit names, it becomes obvious which tools work with which sections. For modules that only work on DOS, this would become obvious from the presence of DOS-specific sections. IMO it would be a disaster to interpret these binaries' names as having a different encoding.

(By the way, this is informed from war stories about a system that accidentally lost the encodings of the strings for user-uploaded content, and could never recover them. The system died a horrific, spasmic death. Literally, millions of dollars were lost.)

We could even adopt a naming standard for names sections (heh), so that they are all "\

@titzer Yeah, custom sections are the solution here for exotic or specialized platforms that want nothing to do with UTF8. I'd be hesitant to prescribe in in the spec, though: if a platform is so specific in its mode of operation that it can't even be bothered to map UTF-8 code points to their native preference, they may want to do a lot more with custom sections than just supply names in their preferred encoding.

I recommend putting a greater emphasis on using custom sections for platform-specific details in the spec, and let the platform's own specifications define those details. Common WASM toolchains could support them via some kind of plug-in architecture.

@titzer Switching to utf8-names sounds fine. As a bonus, it would smooth the transition since browsers could easily support both "names" (in the old format) and "utf8-names" (in the #984 format) for a release or two before dropping "names" which in turn removes a lot of urgency to get this deployed.

Sorry if this was already decided on above but, to be clear: is there any proposed change to the import/export names from what's in BinaryEncoding.md now?

utf8-names sounds fine.

Same question as @lukewagner on import/export.

@lukewagner @jfbastien Good question. I didn't see a decision above. I think above all we don't want to change the binary format from what we have now. So it's really just whatever mental contortions we have to go through to convince ourselves what we did is rational :-)

AFAICT we currently assume that strings in import/exports are uninterpreted sequences of bytes. That's fine. I think it's reasonable to consider the encoding of strings used for import/export to be solely defined by the embedder in a way that the names section is not; E.g. JS always uses UTF-8. The names section comes with an explicit encoding in the name of the names section.

Short version: the encoding of names in import/export declarations is a property of the embedding environment, the encoding of names in the names section is explicit by the string used to identify the user section (e.g. "utf8-names").

WDYT?

That's fine with me and matches what we had before #984 merged (modulo names=>utf8-names).

I think the names section isn't as important as import/export, which are where the true compatibility issues occur:

  • Load a mojibaked names section and you get funky Error.stack and debugging.
  • Load a mojibaked import/export and nothing works.

I don't think this is truly a binary format change since the embeddings we all implement already assume this.

I'd lean on the recommendation of people who know better than I do about this topic before closing.

You'll need to decide on how you decode UTF-8. Do you replace erroneous sequences with U+FFFD or halt on the first error? That is, you either want https://encoding.spec.whatwg.org/#utf-8-decode-without-bom or https://encoding.spec.whatwg.org/#utf-8-decode-without-bom-or-fail. Either way loading will likely fail, unless the resource happened to use U+FFFD in its name.

The way it's currently described we throw an exception if the import/export name byte array fails to decode as UTF-8 into a JS string. After that, you have a JS string and import lookup is defined in terms of Get.

To check my understanding, if we did https://encoding.spec.whatwg.org/#utf-8-decode-without-bom-or-fail, would that mean that, after successful validation, checking for codepoint-sequence equality would be equivalent to checking for byte-sequence equality?

Yes.

After the discussion above, I support validating UTF-8 for import/export names in the core spec.

Specifically, this would be utf-8-decode-without-bom-or-fail, and codepoint-sequence equality (so engines can do byte-sequence equality), so engines would avoid the scary and expensive parts of Unicode and internationalization. And, this is consistent with the Web embedding. I've experimented with this and found the main overhead negligible.

  • Re: Hardware ISAs are agnostic to encoding: The hardware we're talking about here doesn't have imports/exports as such, so the analogy doesn't directly apply. The one place I'm aware of where such hardware uses byte-sequence identifiers of any kind, x86's cpuid, does specify a specific character encoding: UTF-8.

  • Re: Layering: As software engineers, we also know that layering and modularisation are means, not ends in themselves. For example, we could cleanly factor out LEB128 from the core spec. That would provide greater layering and modularisation. LEB128 is arguably biased toward Web use cases.

  • Re: "Embedded systems": An example given is DOS, but what would be an example of something that a UTF-8 requirement for import/export names would require a DOS system to do that would be expensive or impractical for it to do?

  • Re: Islands: WebAssembly also specifies a specific endianness, requires floating-point support, 8-bit address units, and makes other choices, even though there are real settings where those would be needless burdens. WebAssembly makes choices like those when it expects they'll strengthen the common platform that many people can share.

  • Re: Arbitrary data structures in import/export names: this is theoretically useful, but it can also be done via mangling data into strings. Mangling is less convenient, but not difficult. So there's a tradeoff there, but not a big one (and arguably, if there's a general need for attaching metadata to imports/exports, it'd be nicer to have an explicit mechanism than saddling identifiers with additional purposes.)

  • Re: Binary compatibility: I also agree with JF that this change is still feasible. utf-8-decode-without-bom-or-fail would mean no silent behavior changes, and at this time, all known wasm producers keep their output compatible with the Web embedding (even if they also support other embeddings), so they're already staying within UTF-8.

A PR making a specific proposal for UTF-8 names is now posted as https://github.com/WebAssembly/design/issues/1016.

With #1016, this is now fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  ·  7Comments

spidoche picture spidoche  ·  4Comments

konsoletyper picture konsoletyper  ·  6Comments

chicoxyzzy picture chicoxyzzy  ·  5Comments

frehberg picture frehberg  ·  6Comments