Pegjs: Support parsing of indentation-based languages

Created on 16 Oct 2013  ·  34Comments  ·  Source: pegjs/pegjs

I was using a bunch of indentation-based languages, like CoffeeScript, Jade, and want to create DSLs by myself.
I find some tricks maintaining indentations in pegjs by searching, was was wondering if there's a consistant solution:
http://stackoverflow.com/questions/11659095/parse-indentation-level-with-peg-js
http://stackoverflow.com/questions/4205442/peg-for-python-style-indentation
https://gist.github.com/jakubkulhan/3192844
https://groups.google.com/forum/#!searchin/pegjs/indent/pegjs/RkbAB4rPlfU/xxafrY5wGCEJ
But will pegjs support this feature?

feature

Most helpful comment

It's very dangerous to rely on side-effects that you add in custom handlers to parse indentation based grammars. Just don't do it. Pegjs would have to add some ability to push and pop conditional state in order to make parsing indentations (and other context sensitive grammars) safe.

This is what I do for now, and I recommend you do this: Preprocess the input file and insert your own indent/outdent tokens. I use {{{{ and }}}} respectively. Then your grammar is context free and can be parsed normally. It may mess up your line/column values, but you can correct those in a postprocessor.

All 34 comments

It's very dangerous to rely on side-effects that you add in custom handlers to parse indentation based grammars. Just don't do it. Pegjs would have to add some ability to push and pop conditional state in order to make parsing indentations (and other context sensitive grammars) safe.

This is what I do for now, and I recommend you do this: Preprocess the input file and insert your own indent/outdent tokens. I use {{{{ and }}}} respectively. Then your grammar is context free and can be parsed normally. It may mess up your line/column values, but you can correct those in a postprocessor.

If you aren't needing to target javascript, Pegasus, my pegjs clone for C#, has support for pushing/popping state. Here's a wiki article on how to do exactly what you want: https://github.com/otac0n/Pegasus/wiki/Significant-Whitespace-Parsing

I would like to propose that pegjs use my syntax as a starting point for state-based parsing.

The ability to safely push and pop state is nice. I would use that if it was Javascript based. Just not worth it to integrate a CLR just for parsing.

That's what I figured. I think, in that case, that I should probably try to back-port my improvements into pegjs.

However, I don't necessarily want to do that without having a conversation with @dmajda.

@otac0n It's nice. I don't write C# . JavaScript is much better for me.

Indentation-based languages are important. I want to look at simplifying their parsing after 1.0.0.

I think this problem is best solved by allowing state in general, just like Pegasus does and as suggested in #285. Here is an idea (the following is Pegasus’ significant whitespace grammar translated to pegjs and with my syntax idea added):

{var indentation = 0}

program
  = s:statements eof { return s }

statements
  = line+

line
  = INDENTATION s:statement { return s }

statement
  = s:simpleStatement eol { return s }
  / "if" _ n:name _? ":" eol INDENT !"bar " s:statements UNDENT {
      return { condition: n, statements: s }
    }
  / "def" _ n:name _? ":" eol INDENT s:statements UNDENT {
      return { name: n, statements: s }
    }

simpleStatement
  = a:name _? "=" _? b:name { return { lValue: a, expression: b } }

name
  = [a-zA-Z] [a-zA-Z0-9]* { return text() }

_ = [ \t]+

eol = _? comment? ("\r\n" / "\n\r" / "\r" / "\n" / eof)

comment = "//" [^\r\n]*

eof = !.

INDENTATION
  = spaces:" "* &{ return spaces.length == indentation }

INDENT
  = #STATE{indentation}{ indentation += 4 }

UNDENT
  = #STATE{indentation}{ indentation -= 4 }

Note the #STATE{indentation} blocks near the bottom (obviously inspired by Pegasus). I call those state blocks. The idea is to allow a state block before actions. Here is a more complicated state block:

#STATE{a, b, arr: {arr.slice()}, obj: {shallowCopy(obj)}, c}

It is shorthand for:

#STATE{a: {a}, b: {b}, arr: {arr.slice()}, obj: {shallowCopy(obj)}, c: {c}}

In other words, after the shorthand expansion has been applied, the contents of a state block is a list of identifier ":" "{" code "}". Adding a state block before an action tells pegjs that this action will modify the identifiers listed, and if the rule is backtracked those identifiers should be reset to the code between the braces.

Here are the compiled functions for INDENT and UNDENT from the above grammar above, with resetting of the indentation variable added:

    function peg$parseINDENT() {
      var s0, s1, t0;

      s0 = peg$currPos;
      t0 = indentation;
      s1 = [];
      if (s1 !== peg$FAILED) {
        peg$reportedPos = s0;
        s1 = peg$c41();
      } else {
        indentation = t0;
      }
      s0 = s1;

      return s0;
    }

    function peg$parseUNDENT() {
      var s0, s1, t0;

      s0 = peg$currPos;
      t0 = indentation;
      s1 = [];
      if (s1 !== peg$FAILED) {
        peg$reportedPos = s0;
        s1 = peg$c42();
      } else {
        indentation = t0;
      }
      s0 = s1;

      return s0;
    }

And here’s a bit of how the “complicated state block” from above could be compiled:

s0 = peg$currPos;
t0 = a;
t1 = b;
t2 = arr.slice();
t3 = shallowCopy(obj);
t4 = c;
// ...
if (s1 !== peg$FAILED) {
  // ...
} else {
  peg$currPos = s0;
  a = t0;
  b = t1;
  arr = t2;
  obj = t3;
  c = t4;
}

What do you think about this idea of being able to:

  • Tell pegjs about which stateful variables will be modified by an action.
  • Supply the code needed to store those variables if they need to be reset. (Including shorthand syntax for the simple case where the variable is a primitive value.)

And what do you think about the syntax?

Edit: Here’s the proposed syntax grammar (just for fun):

diff --git a/src/parser.pegjs b/src/parser.pegjs
index 08f6c4f..09e079f 100644
--- a/src/parser.pegjs
+++ b/src/parser.pegjs
@@ -116,12 +116,31 @@ ChoiceExpression
     }

 ActionExpression
-  = expression:SequenceExpression code:(__ CodeBlock)? {
+  = expression:SequenceExpression code:((__ StateBlock)? __ CodeBlock)? {
       return code !== null
-        ? { type: "action", expression: expression, code: code[1] }
+        ? {
+            type:       "action",
+            expression: expression,
+            code:       code[2],
+            stateVars:  (code[0] !== null ? code[0][1] : [])
+          }
         : expression;
     }

+StateBlock "state block"
+  = "#STATE{" __ first:StateBlockItem rest:(__ "," __ StateBlockItem)* __ "}" {
+      return buildList(first, rest, 3);
+    }
+
+StateBlockItem
+  = varName:Identifier expression:(__ ":" __ CodeBlock)? {
+      return {
+        type:       "stateVar",
+        name:       varName,
+        expression: expression !== null ? expression[3] : varName
+      };
+    }
+
 SequenceExpression
   = first:LabeledExpression rest:(__ LabeledExpression)* {
       return rest.length > 0

Hi guys,
Just to clarify, am I correct that it is better not to use PEG.js (with workarounds from the top of this issue) with indentation-based languages till this issue is closed?
Thanks.

@hoho I don't get you meaning.. But I later found another solution to parse indentations with parser combinator like solutions and it worked. And I think my original indention to parse indentations with PEG.js gone.

I mean there are workarounds to parse indentation, but the comments say that these workarounds will fail in some certain cases.

Let me clarify the situation: Parsing indentation-based languages in PEG.js is possible. There are various solutions mentioned above and I just created another one as I tried to get a “feel” for this (it’s a grammar of a simple language with two statements, one of which can contain indented sub-statements — similar to e.g. if in Python).

One thing common to all the solutions is that they need to track the indentation state manually (because PEG.js can’t do that). This means there are two limitations:

  1. You can’t compile the grammar with caching safely (because the parser could use cached results instead of executing state-manipulating code).
  2. You can’t backtrack across indentation levels (because there is currently no way to unroll the state when backtracking). In other words, you can’t parse a language where there are two valid constructs which can be disambiguated only after a newline and indentation level change.

Limitation 1 can cause performance issues in some cases, but I don’t think there are many languages for which limitation 2 would be a problem.

I’m OK with this state until 1.0.0 and I plan to circle back to this topic sometime afterwards. The first level of improvement could be getting rid of limitation 2 using more explicit state tracking (as suggested above) or by providing a backtracking hook (so that one can unroll the state correctly). The second level could be getting rid of the need to track indentation state manually by providing some declarative way to do so. This could help with limitation 1.

H, I wrote a (tiny, hacky) patch for PEG.js that supports proper backtracking, as I explained here: https://github.com/pegjs/pegjs/issues/45

sorry for the bump 😜

I was just looking into creating CSON and YAML parsers for a language I am designing, and while looking on ways to create a indention based parser with PEG.js, I came up with a simple method that:

1) doesn't rely on push/pop state's
2) asserting indention levels via code within actions

It had occurred to me that either of the above 2 solutions actually adds performance problems to the generated parsers. Additionally in my opinion:

1) relying on state's not only adds a ugly PEG.js syntax but also can affect what type of parsers that can be generated as they would need to support action based state handing.
2) sometimes adding some code in actions results in a language dependent rule, and for some developers that means they can't use plugin's to generate parsers for other languages like C or PHP without resorting to more plugin's to handle actions on rules, which just means a bigger build system just to support 1 or 2 changes.

After a while I started about creating my own variant of the PEG.js parser and thought: why not just use increment (“++”) and decrement (“--”) prefix operators (__++ expression__ and __-- expression__) to handle the results of match expressions (__expression *__ or __expression +__).

The following is a example grammar based on @dmajda's Simple intentation-based language, rewritten to use the new __++ expression__ and __-- expression__ instead of __& { predicate }__:

Start
  = Statements

Statements
  = Statement*

Statement
  = Indent* statement:(S / I) { return statement; }

S
  = "S" EOS {
      return "S";
    }

I
  = "I" EOL ++Indent statements:Statements --Indent { return statements; }
  / "I" EOS { return []; }

Indent "indent"
  = "\t"
 / !__ "  "

__ "white space"
 = " \t"
 / " "

EOS
  = EOL
  / EOF

EOL
  = "\n"

EOF
  = !.

Much more pleasing to the eye, no? Easer to understand too, both for humans and software.

How does it work? simple:

1) Indent* tells the parser that we want 0 or more of what Indent is returning
2) ++Indent tells the parser to increase the minimum amount of matches required for Indent
3) Now any time the parser is about to return the matches for Indent, it first expects it to be __1 more__ match then before, otherwise _peg$SyntaxError_ gets thrown.
4) --Indent tells the parser to decrease the minimum amount of matches required for Indent
5) Now any time the parser looks for Indent and returns the matches it expects __1 less__ match then before, otherwise _peg$SyntaxError_ gets thrown.

This solution is the best way to add support for 'Significant Whitespace Parsing' without adding an ugly syntax to PEG.js grammars or blocking 3rd party generators.

Here's the changed rules to add support for parsing this in _src/parser.pegjs_:

{
  const OPS_TO_PREFIXED_TYPES = {
    "$": "text",
    "&": "simple_and",
    "!": "simple_not",
    "++": "increment_match",
    "--": "decrement_match"
  };
}

PrefixedOperator
  = "$"
  / "&"
  / "!"
  / "++"
  / "--"

SuffixedOperator
  = "?"
  / "*"
  / "+" !"+"

Am I right to assume that to support it compiler/generator side we will have to:

1) add a compiler pass that ensures __++ expression__ or __-- expression__ are only being used on __expression *__ or __expression +__, where __expression__ must be of types: choice, sequence or rule_ref
2) add a cache based check in the generated parser for __expression *__ or __expression +__ that asserts the minimum required match is met before returning the matches
3) optionally add a helper method for generated parsers to implement that returns number of matches required for a given rule, eg. nMatches( name: String ): Number

@futagoza, this is clean and clever. I like it. I am working on a parser that handles state, but the only state we really need is indentation levels. I may use this idea and give you credit for it. Tracking the indentation level still effectively requires pushing/popping state and so it may still prevent some optimizations but the semantics of this are very nice.

If you're adding operators to a grammar, I recommend adding the @ prefix operator as well. It's purpose is to simply extract a single rule result out of a sequence. Using that the sample grammar becomes even cleaner. No more trivial { return x } Actions.

Start
  = Statements

Statements
  = Statement*

Statement
  = Indent* @(S / I)

S
  = "S" EOS {
      return "S";
    }

I
  = "I" EOL ++Indent @Statements --Indent
  / "I" EOS { return []; }

Indent "indent"
  = "\t"
 / !__ "  "

__ "white space"
 = " \t"
 / " "

EOS
  = EOL
  / EOF

EOL
  = "\n"

EOF
  = !.

@kodyjking what do you think of this?

@futagoza Do you have a fork/branch with the indentation patch enabled and a small sample grammar?

I'm working on this fork/branch indentation

@krinye "I recommend adding the @ prefix operator as well. It's purpose is to simply extract a single rule result out of a sequence"

Could one of you guy have a look and make and make a comment or PR with the fix. Thanks :)

Readme: fork changes

Ah, didn't notice the caveat:

Am I right to assume that to support it compiler/generator side we will have to:

  • add a compiler pass that ensures ++ expression or -- expression are only being used on expression * or expression +, where expression must be of types: choice, sequence or rule_ref
  • add a cache based check in the generated parser for expression * or expression + that asserts the minimum required match is met before returning the matches
  • optionally add a helper method for generated parsers to implement that returns number of matches required for a given rule, eg. nMatches( name: String ): Number

Just for kicks, I tried adding this in visitor.js

      increment_match: visitExpression,
      decrement_match: visitExpression,

Now I get Invalid opcode: undefined.

@kristianmandrup concerning the @ operator to extract single values from sequences, I have a fork with just that feature added to PegJS available here:

https://github.com/krisnye/pegjs

It's a pretty simple addition.

@krisnye +1 for the grammar based implementation, nice and simple. If you don't mind, I will be adding this to my variant of PEG.js 😄

@kristianmandrup see you commit for my suggestion

@futagoza Please do.

I was discussing the indentation logic with a coworker and we suggest the following syntactic elements

// increments the named state variable
identifier++
// decrements the named state variable
identifier--

// repeat constant amount or state variable (with default zero if variable not yet incremented)
rule{integer | identifier}
// repeat min/max using constants or state variables
rule{integer | identifier, integer | identifier}

The parser we are working on can handle arbitrary state, but honestly, the above is all that is needed for indentation parsing.

Thanks a lot guys! If its so easy to do, why not make a "dedicated" fork where the things you mention "just works" ;)
Cheers!

@krisnye

  1. Using __identifier++__ can easily cause a messy error if the developer meant __identifier+__, that's why I chose to go with __++identifier__ and, for consistency, __--identifier__
  2. As mentioned in a different issue about ranges, rule{ STATE_REPEAT / RANGE } can be confused with rule{ ACTION }, especially if you are building a syntax highlighter for PEG.js, so this approach has been denied by @dmajda

@kristianmandrup _(OFF TOPIC)_ I am good at designing features, and sometimes making implementations, but horrible at testing and benchmarking them, so usually I make working variants (without any tests or benchmarks) in private non-repo directories on my pc, and then forget about them 🤣. For my current variant of PEG.js (named ePEG.js 😝, an extended rewrite of PEG.js) I am adding the stuff mentioned here, as well as other features (ranges, imports, templates, etc), so I am adding tests and benchmarks, but I am also currently working on a C++ project which is taking my time, so there's no ETA on that.

@futagoza Thanks mate :) Looking at the feature extensions, but doesn't mention indentation support. Is that included but undocumented or coming down the road?

Someone else from this list pointed me at other parser builder/generator solutions that I might look into as well. Keep me posted! Cheers!

@kristianmandrup It's not included as far as I can tell, but @dmajda said 3 years ago that he will look into it after releasing PEG.js v1, but from what I can tell that wont be for another 2 years, unless he plans to release more minor versions of PEG.js v0 (_0.12_, _0.13_, _etc_)

I meant if you had included indentation support in ePEG already or on the roadmap?

@kristianmandrup oh 😆, its on the roadmap. I haven't updated the ePEG.js repo for a while, and just recently decided to turn it into a complete rewrite of PEG.js instead of a plugin.

@futagoza agreed on the ++/-- as pre operations, and I forgot about action syntax, we changed it to [min,max]

So

++identifier
--identifier
rule[integer | identifier]
rule[integer | identifier, integer | identifier]

@krisnye [ ... ] is used for character classes, see https://github.com/pegjs/pegjs#characters

What I'm doing in ePEG.js is adding ranges (also on the roadmap) to achieve what I think your describing:

space = [ \t]*
rule = range|expression
  • __expression__ can be any of these: __++space__, __--space__ or __space__
  • __range__ can be __min..__, __min..max__, __..max__ or __exact__
  • __min__, __max__ or __exact__ can only be an __unsigned integer__
  • using a __range__ with an __expression__ (e.g. 2|expression) can allow us to set the total amount of __expression__ required for __rule__ to parse successfully.
  • using the __exact__ range with __++expression__ or __--expression__ (e.g. 3|++expression) can allow us to set the __integer__ amount for __++__ or __--__, which is by default __1__
  • using either __min__ or __max__ range with __++expression__ or __--expression__ throws a syntax error.

I'm not using state variables as that would just be confusing with rule identifiers.

Using a combination of __range__, __++__, __--__ or __@__ I'm hoping to create PEG grammar files that rely less on a rules __action__ for their result, which should increase the development time of whitespace (e.g. indentation-based, ASCII art, etc) languages as the language designers and/or implementers don't have to worry about trying to confirm if the correct amount of whitespace has been parsed.
This should also allow plugin developers to create parser generators that can optimise with _less fear_ of what our __action__'s language is (by default JavaScript, but with plugin's can change it to CoffeeScript, PHP, etc).

So it seems like it's still not possible today to parse Python with PEG.js out of the box, is that right?

If not, is this something that will be coming soon? Is there a set of tasks required to make this work that people could contribute?

I have a project where I'd love to be able to get a Python AST in JS, modify it and then convert it back to source code with the same formatting, so I'd be interested in making this happen if there is a clear roadmap.

@mindjuice Not yet, it's planned for post 1.0.

If you cannot wait, my nephew and I created a parser written in TypeScript that uses the same syntax and handles indentation. It's not documented yet, but it's pretty simple. There is still work ongoing as we're using it as the parser for a new language design.

https://github.com/krisnye/pegs

You can also try another parser generator with support such as chevrotain

Python indentation example

I believe it's very possible to parse python-like indentation with PEGjs.
The example below uses only four-space-based indentation, but it can be extended to cover arbitrarily-spaced tabs and other whitespace characters.
In fact, the language I'm working on has a slightly more complicated indentation story than Python has and this grammar works well for it.

{
    let prevIndentCount = 0;
    function print(...s) { console.log(...s); }
}

Indent 'indent'
    = i:("    "+) { 
        let currentIndentCount = i.toString().replace(/,/g, "").length;
        if (currentIndentCount === prevIndentCount + 4) { 
            // DEBUG //
            print("=== Indent ===");
            print("    current:"+currentIndentCount); 
            print("    previous:"+prevIndentCount);
            print("    lineNumber:"+location().start.line); 
            // DEBUG //
            prevIndentCount += 4;
            return "[indent]";
        }
        error("error: expected a 4-space indentation here!")
    } // 4 spaces 

Samedent 'samedent'
    = s:("    "+ / "") &{
        let currentIndentCount = s.toString().replace(/,/g, "").length;
        if (currentIndentCount === prevIndentCount) {
            print("=== Samedent ===");
            return true;
        }
        return false;
    }

Dedent 'dedent'
    = d:("    "+ / "") {
        let currentIndentCount = d.toString().replace(/,/g, "").length;
        if (currentIndentCount < prevIndentCount) {
            // DEBUG //
            print("=== Dedent ===");
            print("    current:"+currentIndentCount); 
            print("    previous:"+prevIndentCount);
            print("    lineNumber:"+location().start.line); 
            // DEBUG //
            prevIndentCount -= 4;
            return "[dedent]";
        }
        error("error: expected a 4-space dedentation here!");
    }

With the grammar above, you can create an indented block rule like this:

FunctionDeclaration 
    = 'function' _ Identifier _ FunctionParameterSection _ ":" _ FunctionBody

FunctionBody
    = Newline Indent FunctionSourceCode (Newline Samedent FunctionSourceCode)* Dedent 
Was this page helpful?
0 / 5 - 0 ratings

Related issues

ronjouch picture ronjouch  ·  3Comments

audinue picture audinue  ·  13Comments

StoneCypher picture StoneCypher  ·  6Comments

StoneCypher picture StoneCypher  ·  8Comments

brettz9 picture brettz9  ·  8Comments