pegjs 🚀 - Support parsing of indentation-based languages

It's very dangerous to rely on side-effects that you add in custom handlers to parse indentation based grammars. Just don't do it. Pegjs would have to add some ability to push and pop conditional state in order to make parsing indentations (and other context sensitive grammars) safe.

This is what I do for now, and I recommend you do this: Preprocess the input file and insert your own indent/outdent tokens. I use {{{{ and }}}} respectively. Then your grammar is context free and can be parsed normally. It may mess up your line/column values, but you can correct those in a postprocessor.

krisnye on 11 Feb 2014

👍3

If you aren't needing to target javascript, Pegasus, my pegjs clone for C#, has support for pushing/popping state. Here's a wiki article on how to do exactly what you want: https://github.com/otac0n/Pegasus/wiki/Significant-Whitespace-Parsing

I would like to propose that pegjs use my syntax as a starting point for state-based parsing.

otac0n on 11 Feb 2014

The ability to safely push and pop state is nice. I would use that if it was Javascript based. Just not worth it to integrate a CLR just for parsing.

krisnye on 11 Feb 2014

That's what I figured. I think, in that case, that I should probably try to back-port my improvements into pegjs.

However, I don't necessarily want to do that without having a conversation with @dmajda.

otac0n on 12 Feb 2014

👍2

@otac0n It's nice. I don't write C# . JavaScript is much better for me.

jiyinyiyong on 12 Feb 2014

Indentation-based languages are important. I want to look at simplifying their parsing after 1.0.0.

dmajda on 21 Apr 2014

I think this problem is best solved by allowing state in general, just like Pegasus does and as suggested in #285. Here is an idea (the following is Pegasus’ significant whitespace grammar translated to pegjs and with my syntax idea added):

{var indentation = 0}

program
  = s:statements eof { return s }

statements
  = line+

line
  = INDENTATION s:statement { return s }

statement
  = s:simpleStatement eol { return s }
  / "if" _ n:name _? ":" eol INDENT !"bar " s:statements UNDENT {
      return { condition: n, statements: s }
    }
  / "def" _ n:name _? ":" eol INDENT s:statements UNDENT {
      return { name: n, statements: s }
    }

simpleStatement
  = a:name _? "=" _? b:name { return { lValue: a, expression: b } }

name
  = [a-zA-Z] [a-zA-Z0-9]* { return text() }

_ = [ \t]+

eol = _? comment? ("\r\n" / "\n\r" / "\r" / "\n" / eof)

comment = "//" [^\r\n]*

eof = !.

INDENTATION
  = spaces:" "* &{ return spaces.length == indentation }

INDENT
  = #STATE{indentation}{ indentation += 4 }

UNDENT
  = #STATE{indentation}{ indentation -= 4 }

Note the #STATE{indentation} blocks near the bottom (obviously inspired by Pegasus). I call those state blocks. The idea is to allow a state block before actions. Here is a more complicated state block:

#STATE{a, b, arr: {arr.slice()}, obj: {shallowCopy(obj)}, c}

It is shorthand for:

#STATE{a: {a}, b: {b}, arr: {arr.slice()}, obj: {shallowCopy(obj)}, c: {c}}

In other words, after the shorthand expansion has been applied, the contents of a state block is a list of identifier ":" "{" code "}". Adding a state block before an action tells pegjs that this action will modify the identifiers listed, and if the rule is backtracked those identifiers should be reset to the code between the braces.

Here are the compiled functions for INDENT and UNDENT from the above grammar above, with resetting of the indentation variable added:

    function peg$parseINDENT() {
      var s0, s1, t0;

      s0 = peg$currPos;
      t0 = indentation;
      s1 = [];
      if (s1 !== peg$FAILED) {
        peg$reportedPos = s0;
        s1 = peg$c41();
      } else {
        indentation = t0;
      }
      s0 = s1;

      return s0;
    }

    function peg$parseUNDENT() {
      var s0, s1, t0;

      s0 = peg$currPos;
      t0 = indentation;
      s1 = [];
      if (s1 !== peg$FAILED) {
        peg$reportedPos = s0;
        s1 = peg$c42();
      } else {
        indentation = t0;
      }
      s0 = s1;

      return s0;
    }

And here’s a bit of how the “complicated state block” from above could be compiled:

s0 = peg$currPos;
t0 = a;
t1 = b;
t2 = arr.slice();
t3 = shallowCopy(obj);
t4 = c;
// ...
if (s1 !== peg$FAILED) {
  // ...
} else {
  peg$currPos = s0;
  a = t0;
  b = t1;
  arr = t2;
  obj = t3;
  c = t4;
}

What do you think about this idea of being able to:

Tell pegjs about which stateful variables will be modified by an action.
Supply the code needed to store those variables if they need to be reset. (Including shorthand syntax for the simple case where the variable is a primitive value.)

And what do you think about the syntax?

Edit: Here’s the proposed syntax grammar (just for fun):

diff --git a/src/parser.pegjs b/src/parser.pegjs
index 08f6c4f..09e079f 100644
--- a/src/parser.pegjs
+++ b/src/parser.pegjs
@@ -116,12 +116,31 @@ ChoiceExpression
     }

 ActionExpression
-  = expression:SequenceExpression code:(__ CodeBlock)? {
+  = expression:SequenceExpression code:((__ StateBlock)? __ CodeBlock)? {
       return code !== null
-        ? { type: "action", expression: expression, code: code[1] }
+        ? {
+            type:       "action",
+            expression: expression,
+            code:       code[2],
+            stateVars:  (code[0] !== null ? code[0][1] : [])
+          }
         : expression;
     }

+StateBlock "state block"
+  = "#STATE{" __ first:StateBlockItem rest:(__ "," __ StateBlockItem)* __ "}" {
+      return buildList(first, rest, 3);
+    }
+
+StateBlockItem
+  = varName:Identifier expression:(__ ":" __ CodeBlock)? {
+      return {
+        type:       "stateVar",
+        name:       varName,
+        expression: expression !== null ? expression[3] : varName
+      };
+    }
+
 SequenceExpression
   = first:LabeledExpression rest:(__ LabeledExpression)* {
       return rest.length > 0

lydell on 15 Feb 2015

Hi guys,
Just to clarify, am I correct that it is better not to use PEG.js (with workarounds from the top of this issue) with indentation-based languages till this issue is closed?
Thanks.

hoho on 8 Nov 2015

@hoho I don't get you meaning.. But I later found another solution to parse indentations with parser combinator like solutions and it worked. And I think my original indention to parse indentations with PEG.js gone.

jiyinyiyong on 9 Nov 2015

I mean there are workarounds to parse indentation, but the comments say that these workarounds will fail in some certain cases.

hoho on 10 Nov 2015

Let me clarify the situation: Parsing indentation-based languages in PEG.js is possible. There are various solutions mentioned above and I just created another one as I tried to get a “feel” for this (it’s a grammar of a simple language with two statements, one of which can contain indented sub-statements — similar to e.g. if in Python).

One thing common to all the solutions is that they need to track the indentation state manually (because PEG.js can’t do that). This means there are two limitations:

You can’t compile the grammar with caching safely (because the parser could use cached results instead of executing state-manipulating code).
You can’t backtrack across indentation levels (because there is currently no way to unroll the state when backtracking). In other words, you can’t parse a language where there are two valid constructs which can be disambiguated only after a newline and indentation level change.

Limitation 1 can cause performance issues in some cases, but I don’t think there are many languages for which limitation 2 would be a problem.

I’m OK with this state until 1.0.0 and I plan to circle back to this topic sometime afterwards. The first level of improvement could be getting rid of limitation 2 using more explicit state tracking (as suggested above) or by providing a backtracking hook (so that one can unroll the state correctly). The second level could be getting rid of the need to track indentation state manually by providing some declarative way to do so. This could help with limitation 1.

dmajda on 27 Nov 2015

H, I wrote a (tiny, hacky) patch for PEG.js that supports proper backtracking, as I explained here: https://github.com/pegjs/pegjs/issues/45

tebbi on 27 Nov 2015

👍1

sorry for the bump 😜

I was just looking into creating CSON and YAML parsers for a language I am designing, and while looking on ways to create a indention based parser with PEG.js, I came up with a simple method that:

1) doesn't rely on push/pop state's
2) asserting indention levels via code within actions

It had occurred to me that either of the above 2 solutions actually adds performance problems to the generated parsers. Additionally in my opinion:

1) relying on state's not only adds a ugly PEG.js syntax but also can affect what type of parsers that can be generated as they would need to support action based state handing.
2) sometimes adding some code in actions results in a language dependent rule, and for some developers that means they can't use plugin's to generate parsers for other languages like C or PHP without resorting to more plugin's to handle actions on rules, which just means a bigger build system just to support 1 or 2 changes.

After a while I started about creating my own variant of the PEG.js parser and thought: why not just use increment (“++”) and decrement (“--”) prefix operators (__++ expression__ and __-- expression__) to handle the results of match expressions (__expression *__ or __expression +__).

The following is a example grammar based on @dmajda's Simple intentation-based language, rewritten to use the new __++ expression__ and __-- expression__ instead of __& { predicate }__:

Start
  = Statements

Statements
  = Statement*

Statement
  = Indent* statement:(S / I) { return statement; }

S
  = "S" EOS {
      return "S";
    }

I
  = "I" EOL ++Indent statements:Statements --Indent { return statements; }
  / "I" EOS { return []; }

Indent "indent"
  = "\t"
 / !__ "  "

__ "white space"
 = " \t"
 / " "

EOS
  = EOL
  / EOF

EOL
  = "\n"

EOF
  = !.

Much more pleasing to the eye, no? Easer to understand too, both for humans and software.

How does it work? simple:

1) Indent* tells the parser that we want 0 or more of what Indent is returning
2) ++Indent tells the parser to increase the minimum amount of matches required for Indent
3) Now any time the parser is about to return the matches for Indent, it first expects it to be __1 more__ match then before, otherwise _peg$SyntaxError_ gets thrown.
4) --Indent tells the parser to decrease the minimum amount of matches required for Indent
5) Now any time the parser looks for Indent and returns the matches it expects __1 less__ match then before, otherwise _peg$SyntaxError_ gets thrown.

This solution is the best way to add support for 'Significant Whitespace Parsing' without adding an ugly syntax to PEG.js grammars or blocking 3rd party generators.

Here's the changed rules to add support for parsing this in _src/parser.pegjs_:

{
  const OPS_TO_PREFIXED_TYPES = {
    "$": "text",
    "&": "simple_and",
    "!": "simple_not",
    "++": "increment_match",
    "--": "decrement_match"
  };
}

PrefixedOperator
  = "$"
  / "&"
  / "!"
  / "++"
  / "--"

SuffixedOperator
  = "?"
  / "*"
  / "+" !"+"

Am I right to assume that to support it compiler/generator side we will have to:

1) add a compiler pass that ensures __++ expression__ or __-- expression__ are only being used on __expression *__ or __expression +__, where __expression__ must be of types: choice, sequence or rule_ref
2) add a cache based check in the generated parser for __expression *__ or __expression +__ that asserts the minimum required match is met before returning the matches
3) optionally add a helper method for generated parsers to implement that returns number of matches required for a given rule, eg. nMatches( name: String ): Number

futagoza on 15 Mar 2017

@futagoza, this is clean and clever. I like it. I am working on a parser that handles state, but the only state we really need is indentation levels. I may use this idea and give you credit for it. Tracking the indentation level still effectively requires pushing/popping state and so it may still prevent some optimizations but the semantics of this are very nice.

If you're adding operators to a grammar, I recommend adding the @ prefix operator as well. It's purpose is to simply extract a single rule result out of a sequence. Using that the sample grammar becomes even cleaner. No more trivial { return x } Actions.

Start
  = Statements

Statements
  = Statement*

Statement
  = Indent* @(S / I)

S
  = "S" EOS {
      return "S";
    }

I
  = "I" EOL ++Indent @Statements --Indent
  / "I" EOS { return []; }

Indent "indent"
  = "\t"
 / !__ "  "

__ "white space"
 = " \t"
 / " "

EOS
  = EOL
  / EOF

EOL
  = "\n"

EOF
  = !.

@kodyjking what do you think of this?

krisnye on 16 Mar 2017

@futagoza Do you have a fork/branch with the indentation patch enabled and a small sample grammar?