Pegjs: Ability to ignore certain productions

Created on 8 Oct 2010 · 29Comments · Source: pegjs/pegjs

It would be nice to be able to tell the lexer/parser to ignore certain productions (i.e. whitespace and comment productions) so that it becomes unnecessary to litter all other productions with comment/whitespace allowances. This may not be possible though, due to the fact that lexing is builtin with parsing?

Thank you

feature

Source

chromaticbum

👍5 👀1

Most helpful comment

@atesgoral - I bailed out. I didn't need a "real parser" - I only needed to isolate certain named elements in the target file.

So I did what any wimpy guy would do - used regular expressions. (And then I had two problems :-)

But it did the trick, so I was able to move on to the next challenge. Good luck in your project!

richb-hanover on 15 Aug 2017

😄1 👍1

All 29 comments

Agreed. Is there a clean way to do this at the moment?

benekastah on 26 Jul 2011

@benekastah: There is no clean way as of now.

This would be hard to do without changing how PEG.js works. Possible solutions include:

Allow to prepend a lexer before the generated parser.
Embed information about ingorined rules somewhere in the grammar. That would probably also mean distinguishing between lexical and syntactical level of the grammar — something I'd like to avoid.

I won't work on this now but it's something to think about in the feature.

dmajda on 13 Aug 2011

I would need this feature to.

May be you could introduce a "skip"-Token. So if a rule returns that token, it will be ignored and get no node at the AST (aka entry in the array).

tlindig on 28 Sep 2011

I am looking for a way to do this as well.

I have a big grammar file (it parses the ASN.1 format for SNMP MIB files). I didn't write, it, but I trivially transformed it from the original form to create a parser in PEG.js. (This is good. In fact, it's extremely slick that it took me less than 15 minutes to tweak it so that PEG.js would accept it.)

Unfortunately, the grammar was written with the ability simply to ignore whitespace and comments when it encounters them. Consequently, no real MIB files can be handled, because the parser stops at the first occurrence of whitespace.

I am not anxious to have to figure out the grammar so that I can insert all the proper whitespace in all the rules (there are about 126 productions...) Is there some other way to do this?

NB: In the event that I have to modify the grammar by hand, I asked for help with some quesitons in a ticket in the Google Groups list. http://groups.google.com/group/pegjs/browse_thread/thread/568b629f093983b7

Many thanks!

richb-hanover on 1 Oct 2011

Thanks to the folks over on Google Groups. I think I got enough information to allow me to do what I want.

But I'm really looking forward to the ability in PEG.js to mark whitespace/comments as something to be ignored completely so that I wouldn't have to take a few hours to modify an otherwise clean grammar... Thanks!

Rich

richb-hanover on 2 Oct 2011

I agree with the assertion that pegjs needs the ability to skip tokens. I may look into it, since if you want to write a serious grammar you will get crazy when putting ws between every token.

rioki on 13 Aug 2013

Since the generated parsers are modular. As a workaround, create a simplistic lexer and use it's output as input to the for-real one e.g:

elideWS.pegjs:

s = input:(whitespaceCharacter / textCharacter)*
{
var result = "";

for(var i = 0;i < input.length;i++) result += input[i];
return result;
}

whitespaceCharacter = [ nt] { return ""; }
textCharacter = c:[^ nt] { return c; }

but that causes problems when whitespace is a delimiter -- like for identifiers

waTeim on 26 Oct 2013

Bumping into this issue quite often.
But it's not easy to write a good lexer (you can end up duplicating a good chunk of the initial grammar to have a coherent lexer).

What I was thinking is to be able to define skip rules that can be used as alternatives whenever there's no match. This introduces the need for a non-breaking class though. Example with arithmetics.pegjs using Floats

  = Term (("+" / "-") Term)*

Term
  = Factor (("*" / "/") Factor)*

Factor
  = "(" Expression ")"
  / Float

Float "float"
  = "-"? # [0-9]+ # ("." # [0-9]+) // # means that skip rules cannot match

// skip rule marked by "!="
// skip rules cannot match the empty string
_ "whitespace"
  != [ \t\n\r]+

Still digesting this. Any feedback? Might be a very stupid idea.

andreineculau on 19 Apr 2014

So the difference is that you want to distinguish when the overall engine is
operating in lexer mode (whitespace is significant) and when not (whitespace is
ignored).

Is there a case when you want to not ignore whitespace when in lexer mode
as an option? Or conversely, when not inside a regex? I think no.

Would the following be equivalent?

Float
"-?[0-9]+("." [0-9]+)”

or otherwise extend peg to process the typical regex’s directly and outside
a quoted string (which includes regexes) whitespace is ignored.

On Apr 19, 2014, at 3:22 PM, Andrei Neculau [email protected] wrote:

Bumping into this issue quite often.
But it's not easy to write a good lexer (you can end up duplicating a good chunk of the initial grammar to have a coherent lexer).

What I was thinking is to be able to define skip rules that can be used as alternatives whenever there's no match. This introduces the need for a non-breaking class though. Example with arithmetics.pegjs using Floats

Expression
= Term (("+" / "-") Term)*

Term
= Factor (("_" / "/") Factor)_

Factor
= "(" Expression ")"
/ Float

Float "float"
= "-"? # [0-9]+ # ("." # [0-9]+) // # means that skip rules cannot match

// skip rule marked by "!="
// skip rules cannot match the empty string
_ "whitespace"
!= [ tnr]+
Still digesting this. Any feedback? Might be a very stupid idea.

—
Reply to this email directly or view it on GitHub.

waTeim on 22 Apr 2014

@waTeim Actually no.

Traditionally the parsing process is split into lexing and parsing. During lexing every character is of significance, that including whitespaces. But these are then lexed into a "discard" token. The parser, when advancing to the next token, will then discard any discard tokens. The important part is that you can discard anything, not just whitespaces. This behavior is exactly what @andreineculau is describing.

The basic idea how to implement this is by needing to additionally check against all discard rules when transitioning from one state to the next.

rioki on 23 Apr 2014

On Apr 23, 2014, at 2:54 PM, Sean Farrell [email protected] wrote:

@waTeim Actually no.

So we agree. The traditional approach is sufficient. There’s no need to have
the strictly-parser part recognize the existence of discarded tokens and there’s
no reason to make the lexer part behave conditionally (in a context sensitive way)
w.r.t. recognizing tokens.

Therefore there’s no need to have glue elements (e.g. ‘#’) in the language
because it suffices that

1) tokens can be created solely from regex and are not context sensitive.
2) tokens can be marked to be discarded without exception.

Traditionally the parsing process is split into lexing and parsing. During lexing every character is of significance, that including whitespaces. But these are then lexed into a "discard" token. The parser, when advancing to the next token, will then discard any discard tokens. The important part is that you can discard anything, not just whitespaces. This behavior is exactly what @andreineculau is describing.

The basic idea how to implement this is by needing to additionally check against all discard rules when transitioning from one state to the next.

—
Reply to this email directly or view it on GitHub.

waTeim on 23 Apr 2014

Ok then I misunderstood you. There may be cases for lexer states, but that is a totally different requirement and IMHO outside of the scope of peg.js.

rioki on 24 Apr 2014

@waTeim @rioki Forget a bit about my suggestion.

Hands on, take this rule. If you would like to simplify the rule's grammar by taking away the *WS, then how would you instruct PEGjs to not allow *WS between field_name and :?

andreineculau on 24 Apr 2014

@andreineculau Because your grammar is whitespace sensitive, this is not applicable. The discard tokens would be part of the grammar, the lexing part to be exact. I don't know what the big issue is here, this was already sufficiently solved in the 70s. Each and every language has it's own skippable tokens and where they are applicable. The whitespaces and comments are as much part of the language definition and thus part of the grammar. It just turn out that with most languages the skippable tokens may be between each and every other token and using a discard rule makes it WAY simpler than writing expr = WS lit WS op WS expr WS ";" for every rule. Just imagine a grammar like the one for C with whitepsace handling?

I understand that retconning discard rules into pegjs is not easy, but that does not mean that it is not a laudable goal.

rioki on 26 Apr 2014

Oh man, free response section! I have a lot to say, so sorry for the length.

1) For the TL;DR people, if I could add any peg elements, I wanted, I would have written it like this

header_field
= field_name ":" field_value

whitespace(IGNORE)
= [t ]+

The addition I’d make is a options section that may be included in any production

The http-bis language would not be limited by this re-write (see appendix a).

2) My problem with the proposed #

It feels like you are exchanging requiring the user to fill the parser definition with a bunch
of discard non terminals (usually whitespace/delimiters) with requiring the user to fill
the parser definition with a bunch of “here characters are not discarded” meta-characters
unnecessarily. Admittedly there would be fewer occurrences of this. It’s the rare case when
people do actually consume delimiters and do something with them, and like I comment in

appendix a, HTTP-bis is not one of those occurrences, just badly documented.

3) User defined parser states

But I can see how it would be easier on the parser definer to simply cut and paste the
language spec from the definition, so if you must have something like this, then
this could be done with lexical states as alluded to earlier by Sean. I think i’d do it
in the following way.

production1(state==1)
= stuff

production2(state==2)
= stuff

production3
= stuff {state = 1}

production4
= stuff {state = 2}

In other words, just like lex/yacc make it possible for productions to only be available

if the system is in a particular state, and allow the user to set that state value.

4) More options

Or you could make it easier on the user and more apparent to the reader with another
option

production(DONTIGNORE)
= stuff

Which would allow the parser to override the default action of discarding tokens marked
as discard but only for that one production. This is really the same as 3, just an easier
read. This is less flexible than the # proposal because either a production is all ignore

or no ignore, but I don’t think that extra flexibility is needed.

5) Adding a parameter to getNextToken() allows context sensitivity

I think all this comes down to is (I’m making some assumptions here) currently, the
parser part calls getNextToken(input), and what needs to happen instead is add a

parameter to it getNextToken(input,options).

Appendix a) That HTTP-bis spec

Ok I’ve read some but have not read all of this

Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing
draft-ietf-httpbis-p1-messaging-26

I don’t like the way they have defined their grammar. I don’t suggest changing the input it
accepts, but I would not have defined it as they did. In particular, I don’t like why they have
defined OWS and RWS and BWS which all equate to exactly the same character string
but in different contexts. They have defined

OWS ::== (SP | HTAB)*
RWS ::== (SP | HTAB)+
BWS ::== OWS

which is just repetition of tabs and spaces

For no good reason. They have made the language harder to parse — require the lexical analyzer
to track it’s context — and they didn’t need to do that.

They have defined OWS as “optional white space” BWS as “bad whitespace” or otherwise optional
whitespace but in the “bad” context — where it isn’t necessary — and RWS required whitespace where it’s
necessary to delimit tokens. Nowhere is this whitespace used except perhaps there might be a parser
warning if it matches BWS (“detected unnecessary trailing whitespace” or some such) which is all
delimiters do anyway.

In their spec, the only place RWS is used is here

Via = 1#( received-protocol RWS received-by [ RWS comment ] )

 received-protocol = [ protocol-name "/" ] protocol-version
                     ; see Section 6.7
 received-by       = ( uri-host [ ":" port ] ) / pseudonym
 pseudonym         = token

but 'protocol-version' is numbers and maybe letters, while 'received-by' is numbers and letters. In other words,
the lexical analyzer is not going to correctly recognize these 2 parts unless they are separated by whitespace
and it’s going to be a syntax error with or without RWS being explicitly identified if there is not at least 1
whitespace character. So just remove RWS from the productions altogether and treat whitespace
everywhere as a delimiter and it doesn’t change the language, just how it’s documented.

On Apr 24, 2014, at 1:23 PM, Andrei Neculau [email protected] wrote:

@waTeim @rioki Forget a bit about my suggestion.

Hands on, take this rule. If you would like to simplify the rule's grammar by taking away the OWS, then how would you instruct PEGjs to not allow OWS between field_name and :?

—
Reply to this email directly or view it on GitHub.

waTeim on 27 Apr 2014

@waTeim I think you are going overboard with this. I have written quite few parsers and I think the lexer states where never really useful as such. In most cases I saw them was where the lexer consumed block comments and it was "simpler" to put the lexer into "block comment mode" and write simpler patterns than the über pattern to consume the comment (and count lines).

I have never seen any proper use of lexer states stemming from the parser. The fundamental problem here is that with one look ahead, when the parser sees the token to switch states, the lexer has already erroneously lexed the next token. What you propose is almost impossible to implement without back-tracking and that is never a good feature in a parser.

When writing a grammar you basically define which productions are considered parsed and what can be sipped. In @andreineculau's example there are two options, either you handle white spaces in the parser or you define the trailing ":" part of the token. ([a-zA-Z0-9!#$%&'+-.^_|~]+ ":").

rioki on 28 Apr 2014

I might suggest turning the problem into specifying a whitelist—which portions do I want to capture and transform—instead of a blacklist. Although whitespace is one problem with the current capture system, the nesting of rules is another. As I wrote in Issue #66, the LPeg system of specifying what you want to capture directly, via transforms or string captures, seems more useful to me than specifying a handful of productions to skip and still dealing with the nesting of every other production.

See my comment in Issue #66 for a simple example of LPeg versus PEG.js with respect to captures. Although the names are a bit cryptic, see the Captures section of the LPeg documentation for the various ways that you can capture or transform a given production (or portion thereof).

Phrogz on 11 Oct 2014

Hello, I've created a snippet to ignore some general cases: null, undefined and strings with only space symbols.
It can be required in the head of the grammar file, like:

{
  var strip = require('./strip-ast');
}

The two ways to improve it:

Customizable filter for terms — to ignore the specific terms that certain grammar is required.
Skip nested empty arrays — this can be done on second stage after strip, it'll remove «pyramids» of nested empty arrays.
If anyone interested, we can upgrade it to a package.

StreetStrider on 9 Jun 2015

@richb-hanover Where did your ASN.1 definition parser efforts land?

atesgoral on 15 Aug 2017

@atesgoral - I bailed out. I didn't need a "real parser" - I only needed to isolate certain named elements in the target file.

So I did what any wimpy guy would do - used regular expressions. (And then I had two problems :-)

But it did the trick, so I was able to move on to the next challenge. Good luck in your project!

richb-hanover on 15 Aug 2017

😄1 👍1

Having had a look at chevrotain and its skip option, something like this is hugely desirable.

Too often we find ourselves writing something like this:

Pattern = head:PatternPart tail:( WS "," WS PatternPart )*
{
  return {
    type: 'pattern',
    elements: buildList( head, tail, 3 )
  };
}

Would be cool if we could write this instead:

WS "whitespace" = [ \t\n\r] { return '@@skipped' }

IgnoredComma = "," { return '@@skipped' }

Pattern = head:PatternPart tail:( WS IgnoredComma WS PatternPart )*
{
  return {
    type: 'pattern',
    elements: [head].concat(tail)
  };
}

Izhaki on 28 Aug 2018

@richb-hanover, and anybody else who got here in search of a similar need, I ended up writing my own parsers, too: https://www.npmjs.com/package/asn1exp and https://www.npmjs.com/package/asn1-tree

atesgoral on 29 Aug 2018

A skip would be relatively easy to implement using es6 symbol, or maybe more durably by passing the parser a predicate at parse time (I prefer the latter option)

StoneCypher on 3 Feb 2020

Just stumbled upon this too.
Not knowing anything about the innards of PEG.js, lemme throw a bone out there...

When we write a rule, at the end of it we can add a return block.
In that block, we can call things like text() and location(). These are internal functions.

Somewhere in the code the returned value of that block goes into the output stream.

So what would need to change in PEG.js if I want to skip a value returned by a rule if that value is the return of callking a skip local function ?

e.g. comment = "//" space ([^\n])* newline { return skip() }

As mentioned above, skip() could return a Symbol, which is then checked by the code somewhere and removed.
Something like what lzhaki said, but internal to the library

darlanalves on 3 Dec 2020

I don't understand your question. Are you looking for a way to fail a rule under some circumstances? Use &{...} or !{...}. Otherwise just don't use returned value of comment rule:

seq = comment r:another_rule { return r; };
choice = (comment / another_rule) { <you need to decide what to return instead of "comment" result> };

Mingun on 3 Dec 2020

👍1

If it helps anyone, I ignore whitespace by having my top-level rule filter the array of results.

Example:

    = prog:expression+ {return prog.filter(a => a)}

expression
    = float
    / number
    / whitespace

float
    = digits:(number"."number) {return parseFloat(digits.join(""),10)}

number 
    = digits:digit+ {return parseInt(digits.join(""),10)}

digit 
    = [0-9]

whitespace
    = [ \t\r\n] {return undefined}

This will happily parse input while keeping whitespace out of the result array.
This will also work for things like comments, just have the rule return undefined and the top level rule will filter it out

stoneRdev on 15 Apr 2021

That only works for top-level productions. You have to manually filter every parent that could contain a filterable child.

StoneCypher on 15 Apr 2021

@StoneCypher True, it does require some top level work, but it works for me, and I think as long as the gammar isn't too complex, one should be able to get away with having a top level filter.

Other than that, all I can think of is have a top level function that filters whitespace from input and pass every match through it. Slower for sure, and requires alot more calls, but easy if you (like me) pass everything into a token generator. You can call the filter function from where you generate tokens, and only have to worry about generating your tokens and the whitespace is more or less automatically filtered

stoneRdev on 15 Apr 2021

One of the things I liked about the current HEAD of pegjs is its (undocumented) support for picking fields without having to create labels and do return statements. It looks like this:

foo = @bar _ @baz
bar = $"bar"i
baz = $"baz"i
_ = " "*

parse('barbaz') // returns [ 'bar', 'baz' ]

I feel like this gives nice, clean, explicit syntax for this use case plus a bunch of others.

hildjj on 15 Apr 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Allow returning match result of a specific expression in a rule without an action

alanmimms · 10Comments

Non-greedy operators for * , + , and ?

richb-hanover · 7Comments

Yarn Workspace

futagoza · 6Comments

[question] Advice on {cache: true} and handling reasonable out-of-memory case

ronjouch · 3Comments

Add a non-Eval alternative for Content Security Policy compatibility

vldmr1986 · 12Comments