Pegjs: Add ability to track node position

Created on 13 Aug 2011 · 15Comments · Source: pegjs/pegjs

PEG.js-generated parsers currently don't track position (line and column). I'd like to add this feature since it would be quite useful.

One way is to add line and column properties to each object returned as a match result:

start = "a" b:"b" { return [b.line, b.column]; } // Returns [1, 2] on the input "ab".

Another way is just to make special line and column variables available inside actions/predicates, referring to the position of the beginning of the current rule:

start = "a" "b" { return [line, column]; } // Returns [1, 1] on the input "ab".

The first way is more flexible, but this flexibility might not be needed actually. I am not sure which way I'll implement yet.

Both ways would hurt performance. To prevent this in cases where position tracking is not required, the tracking should be enabled only if trackPosition option with a truthy value is passed to PEG.buildParser when generating the parser.

feature

Source

dmajda

Most helpful comment

@tomitrescak this feature has apparently changed a few times in the past. Have a look at the changelog for more information. tl;dr is that you need to use the location() function in your grammar

Turbo87 on 23 Feb 2017

👍2 🎉1

All 15 comments

Either options will be great. For my needs simply line,column will be fine.

jweir on 14 Aug 2011

I'm also looking for a fix for this. Currently, I've hacked something quickly based on computeErrorPosition(), but it has several kinks in it.

On the surface, the first approach seems more intuitive than the second.

s3u on 15 Aug 2011

i like the first approach better, also.

izuzak on 15 Aug 2011

One possible efficiency is to have name.position be a function, so it can be lazily calculated. This could then return a 2-item array of line and column. Just an idea for those who don't want to calculate the position for every single item.

ckknight on 23 Aug 2011

In hindsight, it seems better to provide the start and end character positions, and then provide a helper function in the library that converts the position to line/column. Since in the general case, you don't need line/column unless there is an error, and once you have reached an error, you typically only need it once.

I'm currently doing this in a hackish way by abusing the startPos0 and pos variables, which feels hackish, but works for now.

ckknight on 28 Aug 2011

Until there is an official fix, I'm including this function (a rough copy of computeErrorPosition) at the top of my grammars. At least it will give me the _current_ position (although not the position of the individual nodes):

  function computeCurrentPos() {
    /*
     * The first idea was to use |String.split| to break the input up to the
     * error position along newlines and derive the line and column from
     * there. However IE's |split| implementation is so broken that it was
     * enough to prevent it.
     */

    var line = 1;
    var column = 1;
    var seenCR = false;

    for (var i = 0; i < pos; i++) {
      var ch = input.charAt(i);
      if (ch === '\n') {
        if (!seenCR) { line++; }
        column = 1;
        seenCR = false;
      } else if (ch === '\r' | ch === '\u2028' || ch === '\u2029') {
        line++;
        column = 1;
        seenCR = true;
      } else {
        column++;
        seenCR = false;
      }
    }

    return { line: line, column: column, pos: pos };
  }

wolever on 24 Dec 2011

This issue was fixed by a series of commits.

You can now pass the trackLineAndColumn option to the PEG.buildParser function:

var parser = PEG.buildParser(myGrammar, { trackLineAndColumn: true});

Setting trackLineAndColumn to true makes two new variables visible in the actions and predicates — line and column. For actions, these variables denote start position of the action's expression while in predicates they denote the current position. The slightly different behavior is motivated by expected usage.

The line and column tracking is optional because it hurts performance (it makes parsers about 3-4× slower). I may be able optimize this somewhat in the future (I tried to make it working first).

At the issue description, I was mentioning a different approach to this problem: adding line and column properties to each match result. While this solution was cleaner and preferred by users, I realized that one can't set properties on primitive values such as strings, number or booleans (which are often returned as match results). Returning instances of wrapping objects (e.g. String, Number or Boolean) would be a possible workaround, but it would probably lead to subtle bugs in the parsers created by users, because these wrappers behave slightly differently than the primitives. Thus I decided to implement the alternative proposal.

dmajda on 26 Mar 2012

Will online test page get updated to support this feature?

paulftw on 5 Apr 2012

@paulftw The online editor always uses the latest stable version of PEG.js (currently 0.6.2). I'll update it once I release PEG.js 0.7.0 (which will include this feature). Unless something unexpected comes up, this will happen in the second half of April.

Note that the website source code is available in GitHub, so if you are impatient, you can run and modify it yourself.

dmajda on 8 Apr 2012

Just realized that all rows and columns are 1-based, while JS arrays as well as pretty much all modern languages and libraries use 0-based indexing.

Any chance to revert that design decision?

paulftw on 13 Apr 2012

@paulftw Can you give an example of a parser generator that reports lines and columns as 0-based?

The idea behind my decision is that these numbers will most likely be displayed to users (in error messages, as node positions, etc.), so 1-based indexing makes more sense than 0-based. For machine processing, the offset variable will probably be used most often, an that one is 0-based.

Any chance to revert that design decision?

Yes, if I become convinced :-)

dmajda on 19 Apr 2012

I use peg.js to highlight text in the Ace editor. Quite naturally Ace line numbers are 0-based.
It seems, however, that in Bison locations are 1-based.
http://git.savannah.gnu.org/cgit/bison.git/tree/src/location.c#n73

If this is indeed the case, then I should stop arguing.

Not sure what kind of example you want to see.
Right now I use following code to find a symbol name:

captor
= "" { return new Position(line - 1, column - 1); }

symbol
= start:captor name:IDENT end:captor S* { return new Symbol(name, new Range(start, end)); }

paulftw on 20 Apr 2012

@paulftw I've split this problem into a separate issuse. I'd wait for users to adopt 0.7.0 and see what the prevalent usage is to decide.

dmajda on 20 Apr 2012

I know this is long closed, but how can I make this work? Am quite noob in this. I compiled my grammar with the "trackLineAndColumn': true" flag, but nodes still do not contain lines and columns. What else is needed?

You mention following, I just do not know how to use it is it documented somewhere please?

Setting trackLineAndColumn to true makes two new variables visible in the actions and predicates — line and column. For actions, these variables denote start position of the action's expression while in predicates they denote the current position. The slightly different behavior is motivated by expected usage.