Pegjs: Full Unicode support, namely for codepoints outside the BMP

Created on 22 Oct 2018  ·  15Comments  ·  Source: pegjs/pegjs

Issue type

  • Bug Report: yes
  • Feature Request: kinda
  • Question: no
  • Not an issue: no

Prerequisites

  • Can you reproduce the issue?: yes
  • Did you search the repository issues?: yes
  • Did you check the forums?: yes
  • Did you perform a web search (google, yahoo, etc)?: yes

Description

JavaScript is, without some custom boilerplate, unable to properly deal with Unicode characters/codepoints outside the BMP, i.e., ones whose encoding requires more than 16 bits.

This limitation seems to carry over to PEG.js, as shown in the example below.

In particular, I'd like be be able to specify ranges such as [\u1D400-\u1D419] (which presently turns into [ᵀ0-ᵁ9]) or equivalently [𝐀-𝐙] (which throws an "Invalid character range" error). (And using the newish ES6 notation [\u{1D400}-\u{1D419}] results in the following error: SyntaxError: Expected "!", "$", "&", "(", ".", character class, comment, end of line, identifier, literal, or whitespace but "[" found..)

Might there be a way to make this work that does not require changes to PEG.js?

Steps to Reproduce

  1. Generate a parser from the grammar given below.
  2. Use it to try to parse something ostensibly-conforming.

Example code:

This grammar:

//MathUpper = [𝐀-𝐙]+
MathUpperEscaped = [\u1D400-\u1D419]+

Expected behavior:

The parser generated from the given grammar successfully parses, for example, "𝐀𝐁𝐂".

Actual behavior:

A parse error: Line 1, column 1: Expected [ᵀ0-ᵁ9] but " (Or, when uncommenting the other rule, an "Invalid character range" error.)

Software

  • PEG.js: 0.10.0
  • Node.js: Not applicable.
  • NPM or Yarn: Not applicable.
  • Browser: All browsers I've tested.
  • OS: macOS Mojave.
  • Editor: All editors I've tested.
feature need-help task

All 15 comments

To be entirely honest, aside from updating Unicode support for the PEG.js grammar parser and the JavaScript example, I have little to no knowledge about Unicode, so am currently unable to fix this problem (it's clearly stated in both of the grammar's: _Non-BMP characters are completely ignored_).

For now, while working on more urgent personel and work-related projects (including _PEG.js 0.x_), I'll continue waiting for someone who understands Unicode better to offer some PR's 😆, or eventually get around to it after _PEG.js v1_, sorry buddy.

FYI, surrogate pairs appear to work. The grammar
start = result:[\uD83D\uDCA9]+ {return result.join('')}
parses 💩 which is u+1F4A9. Note the result.join('') puts the surrogate pair back together, otherwise you get ['\uD83D','\uDCA9'] instead of hankey. Ranges would be problematic.
More on surrogate pairs: https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

This by no means replaces what the OP asked for.

@drewnolan Thanks for the heads up 👍

Unfortunately that grammar also parses \uD83D\uD83D.

For others who have run into this issue: I'm lucky in that I only need to handle a small subset of codepoints outside the BMP, so I ended up mapping them into the BMP's private use area prior to parsing and inverting this mapping right afterwards.

This solution is obviouly fraught with problems in the general case, but it works well in my problem domain.

@futagoza - I'll do my best to explain. You are facing several problems here.

  1. Unicode has encodings, notations, and ranges

    1. Here, the things that matter are "ranges" - which characters are supported - and "notations" - how they are written

    2. These change over time. Unicode periodically adds new characters, like when they added emoji, or musical notes

  2. Unicode has "notations." These are things like utf-16, ucs4, et cetera. These are the way that the codepoints, which are the intended data, are encoded as bytes. utf-16-le by example lets you encode most letters as two-byte pairs called code units, but use groups of code units to express high value characters up to 0x10ffff.

    1. Key understanding: that's not really high enough. Lots of interesting stuff, like emoji, big chunks of historical chinese, and this post's base question (blackboard math characters) are above that line



      1. ISO and Unicode Consortium have been clear: they're never fixing it. If you want higher characters, use a larger encoding than utf-16.



    2. Key understanding #2: Fucking javascript is formally utf16



      1. This means that there exist unicode characters (lots of them) that the Javascript string type cannot represent


      2. OP is asking you to fix this


      3. This is possible but not easy - you'd have to implement the Unicode parsing algorithm, which is famously a rat's nest



I want this too, but realistically, this is not going to happen

holy shit, someone committed a full string parser replacement almost a year ago, and they recognized the overhead, so they let us use standard JS strings usually

WHY ISN'T THAT MERGED

@StoneCypher I love the fire in your heart! But why give the current maintainer a hard time? No one is owed anything. Why not maintain your own fork?

There is no current maintainer. The person who took PEG over has never released anything. He worked on the next minor for three years, then said he didn't like how it looked, is throwing all of peg.js away, and starting over from something he wrote from scratch in a different language, with an incompatible AST.

The tool has lost half its userbase waiting three years on this man to commit one-line fixes that other people wrote, like es6 module support, typescript support, arrow support, extended unicode, et cetera.

There are a dozen people asking him to merge and he keeps saying "no, this is my hobby project now and I don't like what it is"

Lots of people have companies based on this parser. They're completely screwed.

This man promised to be a maintainer for an extremely important tool, and hasn't done any maintenance. It's time to let someone else keep this library in good standing now.

Why not maintain your own fork?

I have for three years now. My peg has almost a third of the issue tracker fixed.

I had to clone it, rename it, and make a new fork to fix the size problem to try to commit it, because mine has drifted too much

It's time for everyone else to receive these fixes, as well as the ones that have been sitting in the tracker since 2017.

This guy isn't maintaining peg; he's letting it die.

It's time for change.

@drewnolan - so, I'm not sure whether this is interesting or not, but, surrogate pairs don't actually work. It's just that, by coincidence, they usually do.

In order to understand the underlying problem you have to think about the encoding level bit pattern, not the representation level one.

That is, if you have a unicode value of 240, most people will think "Oh, he means 0b1111 0000." But actually, that's not how Unicode represents 240; over 127 is represented by two bytes, because the top bit is a flag, not a value bit. So 240 in Unicode is actually 0b0000 0001 0111 0000 in storage (except in utf-7, which is real and not a typo, where things get super weird. And yes, I know Wikipedia says it's not in use. Wikipedia is wrong. It's what SMS is sent over; it might be the most common character encoding by total traffic.)

So here's the problem.

If you write a byte STUV WXYZ, then in utf16, from ucs4 data, if your thing gets cut in half, pretty often you can just staple it back together.

One time in 128 you cannot, for characters natively over two bytes' encoding. (Sounds like an awfully specific number, doesn't it?)

Because when that top bit in the second byte position is in use, cutting it in half will add a zero where that should have been a one. Stapling them back together side by side as binary data doesn't remove the value concept again. The decoded value is not equivalent to the encoded value, and you're appending decodings, not encodings.

It so happens that most emoji are outside this range. However, big chunks of several languages aren't, including rare chinese, most math, and musical symbols.

Granted, your approach is good enough for almost all emoji, and for every human language common enough to get in by Unicode 6, which is a huge improvement over the status quo

But this PR really should be merged, once it's sufficiently tested both for correctness and against unexpected performance problems (remember, unicode performance problems are why php died)

It seems the . (dot character) expression also needs Unicode mode. Compare:

const string = '-🐎-👱-';

const symbols = (string.match(/./gu));
console.log(JSON.stringify(symbols, null, '  '));

const pegResult = require('pegjs/')
                 .generate('root = .+')
                 .parse(string);
console.log(JSON.stringify(pegResult, null, '  '));

Output:

[
  "-",
  "🐎",
  "-",
  "👱",
  "-"
]
[
  "-",
  "\ud83d",
  "\udc0e",
  "-",
  "\ud83d",
  "\udc71",
  "-"
]

I worked recently on this, using #616 as a basis and modifying it to use ES6 syntax \u{hhhhhhh}, I will create a PR in some hours.

Computing the UTF-16 regex ranges split by surrogates is a bit complicated and I used https://github.com/mathiasbynens/regenerate for this; this would be the first dependency of the package pegjs, I hope it is possible (there are also polyfills for Unicode properties which could be added as dependency, see #648). See Wikipedia if you don’t know UTF-16 surrogates.

To make PEG.js compatible with the whole Unicode, there are various levels:

  1. Add a syntax to encode Unicode characters above the BMP, fixed by #616 or my ES6-syntax version,
  2. Recognise constant strings, directly provided by the previous point,
  3. Fix the SyntaxError reporting to possibly display 1 or 2 code units to display the real Unicode character,
  4. Compute accurately the regex class for BMP and/or astral code points - this alone is not working, see next point,
  5. Manage the cursor increment because a regex class can now be (1), (2), or (1 or 2 depending on runtime), see details below,
  6. Implement the rule dot . to catch 1 or 2 code units.

For most points we can manage to be backward-compatible and generate parsers very similar to older ones, except for the point 5 because the result of a parsing can depend if the dot rule captures one or two code units. For this I propose to add a runtime option to let the user choice between two or three choices:

  1. The dot rule only captures a BMP code unit,
  2. The dot rule captures a Unicode code point (1 or 2 code units),
  3. The dot rule captures a Unicode code point or a lone surrogate.

Regex class can be analysed statically during parser generation to check if they have a fixed length (in number of code units). There are 3 cases: 1. only a BMP or a single code unit, or 2. only a two code units, or 3. one-or-two code units depending on runtime. For now the bytecode assumes a regex class is always one code unit (see here). With the static analysis, we could change this parameter of this bytecode instruction to 1 or 2 for the two first cases. But for the third case, I guess a new bytecode instruction should be added to, at runtime, obtain the number of code units matched and increase the cursor accordingly. Other options without a new bytecode instruction would be: 1. to always compute the number of code units matched but this is a performance penalty during parsing for BMP-only parsers so I don’t like this option; 2. to compute if the current code unit is a high surrogate followed by a low surrogate to increment of 1 or 2, but this would assumes the grammar has always well-formed UTF-16 surrogates without the possibility to write grammars with lone surrogates (see next point) and this is also a performance penalty for BMP-only parsers.

There is the question of lone surrogates (high surrogate without low surrogate after it, or low surrogate without high surrogate before it). My opinion about this is that a regex class should be exclusively: either with lone surrogates either with well-formed UTF-16 Unicode characters (BMP or a high surrogate followed by a low surrogate), else there is the danger that grammar authors unaware of UTF-16 subtleties mix both and do not understand the result, and grammar authors who want to manage themselves UTF-16 surrogates can do it with PEG rules to describe links between specific high and low surrogates. I propose to add a visitor enforcing this rule during parser generation.

To conclude, it is probably easier to manage the question of lone surrogates in PEG than in regex because the PEG parser always advances, so either next code unit is recognised either it is not, at the contrary of regexes where possibly some backtracking could associate or disassociate a high surrogate with a low surrogate and consequently change the number of Unicode characters matched, etc.

The PR for ES6 syntax for astral Unicode character is #651 based on #616 and the development for classes is https://github.com/Seb35/pegjs/commit/0d33a7a4e13b0ac7c55a9cfaadc16fc0a5dd5f0c implementing the points 2 and 3 in my comment above, and only a quick hack for cursor increment (point 4) and nothing for now for the rule dot . (point 5).

My current development about this issue is mostly finished, the more advanced work is in https://github.com/Seb35/pegjs/tree/dev-astral-classes-final. All five points mentioned above are treated and the global behaviour tries to mimic JS regexes regarding the edge cases (and there is a lot of them).

The global behaviour is governed by the option unicode similar to the flag unicode in JS regexes: the cursor is increased of 1 Unicode character (1 or 2 code units) depending on the actual text (e.g. [^a] matches the text "💯" and the cursor is increased of 2 code units). When the option unicode is false the cursor is always increased of 1 code unit.

Regarding the input I’m not sure if we design PEG.js the same way that JS regexes: should we authorise [\u{1F4AD}-\u{1F4AF}] (equivalent to [\uD83D\uDCAD-\uD83D\uDCAF]) in the grammar in non-Unicode mode? We can make a difference between "input Unicode" and "output Unicode":

  • input Unicode is about authorising all Unicode characters in characters classes (which is computed internally as either a fixed 2-code-units or a fixed 1-code-unit)
  • output Unicode is the cursor increment of the resulting parser: 1 code unit or 1 Unicode character for the rules 'dot' and 'inverted characters class' – the only rules where the characters are not explicitly listed and where we need a decision from the grammar author

Personally I guess I would prefer we authorise input Unicode, either permanently or with an option with a default true since there is no significant overhead and this would enable this possibility for everyone by default, but the output Unicode should remain false because performance of generated parsers is better (always a cursor increment of 1).

Regarding this issue in general (and about the output Unicode defaulting to false), we should keep in mind that it is already possibly to encode in our grammars Unicode characters, at the price of understanding the functioning of UTF-16:

// rule matching [\u{1F4AD}-\u{1F4AF}]
my_class = "\uD83D" [\uDCAD-\uDCAF]

// rule matching any Unicode character
my_strict_unicode_dot_rule = $( [\u0000-\uD7FF\uE000-\uFFFF] / [\uD800-\uDBFF] [\uDC00-\uDFFF] )

// rule matching any Unicode character or a lone surrogate
my_loose_unicode_dot_rule = $( [\uD800-\uDBFF] [\uDC00-\uDFFF]? / [\u0000-\uFFFF] )

So a grammar author who want both a quick parser and be able to recognise Unicode characters in specific parts of her/his grammar can use such rule. Consequently this issue is merely about simplifying Unicode management without diving into UTF-16 internals.


About the implementation, I considered in my first attempt that the grammar text was encoded in Unicode characters and the 'dot' rule of the PEG.js parser recognised Unicode characters. The second and final attempt reverted this (the dot rule is always 1 code unit for quicker parsing) and there is a small algorithm in the visitor prepare-unicode-classes.js to reconstruct split Unicode characters in characters classes (e.g. [\uD83D\uDCAD-\uD83D\uDCAF] is syntactically recognised as [ "\uD83D", [ "\uDCAD", "\uD83D" ], "\uDCAF" ] and this algorithm transforms this to [ [ "\uD83D\uDCAD", "\uD83D\uDCAF" ] ]). I envisaged to write this in the grammar itself but it would have been long and more importantly there are multiple ways to encode the characters ("💯", "uD83DuDCAF", "u{1F4AF}"), so it is easier to write it in a visitor.

I added two opcodes in the second attempt:

  • MATCH_ASTRAL similar to MATCH_ANY but matching a Unicode character (input.charCodeAt(currPos) & 0xFC00) === 0xD800 && input.length > currPos + 1 && (input.charCodeAt(currPos+1) & 0xFC00) === 0xDC00
  • MATCH_CLASS2 very similar to MATCH_CLASS but matching the next two code units instead of just one classes[c].test(input.substring(currPos, currPos+2)
    Then, depending if we match a 2-code-unit character or a 1-code-unit character, the cursor is increased of 1 or 2 code units with the opcode ACCEPT_N, and the characters classes are split in two regexes of fixed length (1 or 2 code units).

I did some optimisations with the "match" feature, eliminating during generation the "dead code" paths depending on the mode (Unicode or not) and the characters class.

Note also that regexes are always positive in this implementation: inverted regexes return the opposite resulting in the bytecode. This was easier to avoid edge cases around surrogates.

I wrote some documentation but I will probably add some more (perhaps a guide to quickly explain the details of the option(s) unicode and the snippets with the home-made Unicode dot rule). I will add tests before submitting it as a PR.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mattkanwisher picture mattkanwisher  ·  5Comments

dmajda picture dmajda  ·  7Comments

audinue picture audinue  ·  13Comments

Coffee2CodeNL picture Coffee2CodeNL  ·  13Comments

alanmimms picture alanmimms  ·  10Comments