Pegjs: Ability to ignore certain productions

Created on 8 Oct 2010  Ā·  29Comments  Ā·  Source: pegjs/pegjs

It would be nice to be able to tell the lexer/parser to ignore certain productions (i.e. whitespace and comment productions) so that it becomes unnecessary to litter all other productions with comment/whitespace allowances. This may not be possible though, due to the fact that lexing is builtin with parsing?

Thank you

feature

Most helpful comment

@atesgoral - I bailed out. I didn't need a "real parser" - I only needed to isolate certain named elements in the target file.

So I did what any wimpy guy would do - used regular expressions. (And then I had two problems :-)

But it did the trick, so I was able to move on to the next challenge. Good luck in your project!

All 29 comments

Agreed. Is there a clean way to do this at the moment?

@benekastah: There is no clean way as of now.

This would be hard to do without changing how PEG.js works. Possible solutions include:

  1. Allow to prepend a lexer before the generated parser.
  2. Embed information about ingorined rules somewhere in the grammar. That would probably also mean distinguishing between lexical and syntactical level of the grammar ā€” something I'd like to avoid.

I won't work on this now but it's something to think about in the feature.

I would need this feature to.

May be you could introduce a "skip"-Token. So if a rule returns that token, it will be ignored and get no node at the AST (aka entry in the array).

I am looking for a way to do this as well.

I have a big grammar file (it parses the ASN.1 format for SNMP MIB files). I didn't write, it, but I trivially transformed it from the original form to create a parser in PEG.js. (This is good. In fact, it's extremely slick that it took me less than 15 minutes to tweak it so that PEG.js would accept it.)

Unfortunately, the grammar was written with the ability simply to ignore whitespace and comments when it encounters them. Consequently, no real MIB files can be handled, because the parser stops at the first occurrence of whitespace.

I am not anxious to have to figure out the grammar so that I can insert all the proper whitespace in all the rules (there are about 126 productions...) Is there some other way to do this?

NB: In the event that I have to modify the grammar by hand, I asked for help with some quesitons in a ticket in the Google Groups list. http://groups.google.com/group/pegjs/browse_thread/thread/568b629f093983b7

Many thanks!

Thanks to the folks over on Google Groups. I think I got enough information to allow me to do what I want.

But I'm really looking forward to the ability in PEG.js to mark whitespace/comments as something to be ignored completely so that I wouldn't have to take a few hours to modify an otherwise clean grammar... Thanks!

Rich

I agree with the assertion that pegjs needs the ability to skip tokens. I may look into it, since if you want to write a serious grammar you will get crazy when putting ws between every token.

Since the generated parsers are modular. As a workaround, create a simplistic lexer and use it's output as input to the for-real one e.g:

elideWS.pegjs:

s = input:(whitespaceCharacter / textCharacter)*
{
var result = "";

for(var i = 0;i < input.length;i++) result += input[i];
return result;
}

whitespaceCharacter = [ nt] { return ""; }
textCharacter = c:[^ nt] { return c; }

but that causes problems when whitespace is a delimiter -- like for identifiers

Bumping into this issue quite often.
But it's not easy to write a good lexer (you can end up duplicating a good chunk of the initial grammar to have a coherent lexer).

What I was thinking is to be able to define skip rules that can be used as alternatives whenever there's no match. This introduces the need for a non-breaking class though. Example with arithmetics.pegjs using Floats

  = Term (("+" / "-") Term)*

Term
  = Factor (("*" / "/") Factor)*

Factor
  = "(" Expression ")"
  / Float

Float "float"
  = "-"? # [0-9]+ # ("." # [0-9]+) // # means that skip rules cannot match

// skip rule marked by "!="
// skip rules cannot match the empty string
_ "whitespace"
  != [ \t\n\r]+

Still digesting this. Any feedback? Might be a very stupid idea.

So the difference is that you want to distinguish when the overall engine is
operating in lexer mode (whitespace is significant) and when not (whitespace is
ignored).

Is there a case when you want to not ignore whitespace when in lexer mode
as an option? Or conversely, when not inside a regex? I think no.

Would the following be equivalent?

Float
"-?[0-9]+("." [0-9]+)ā€

or otherwise extend peg to process the typical regexā€™s directly and outside
a quoted string (which includes regexes) whitespace is ignored.

On Apr 19, 2014, at 3:22 PM, Andrei Neculau [email protected] wrote:

Bumping into this issue quite often.
But it's not easy to write a good lexer (you can end up duplicating a good chunk of the initial grammar to have a coherent lexer).

What I was thinking is to be able to define skip rules that can be used as alternatives whenever there's no match. This introduces the need for a non-breaking class though. Example with arithmetics.pegjs using Floats

Expression
= Term (("+" / "-") Term)*

Term
= Factor (("_" / "/") Factor)_

Factor
= "(" Expression ")"
/ Float

Float "float"
= "-"? # [0-9]+ # ("." # [0-9]+) // # means that skip rules cannot match

// skip rule marked by "!="
// skip rules cannot match the empty string
_ "whitespace"
!= [ tnr]+
Still digesting this. Any feedback? Might be a very stupid idea.

ā€”
Reply to this email directly or view it on GitHub.

@waTeim Actually no.

Traditionally the parsing process is split into lexing and parsing. During lexing every character is of significance, that including whitespaces. But these are then lexed into a "discard" token. The parser, when advancing to the next token, will then discard any discard tokens. The important part is that you can discard anything, not just whitespaces. This behavior is exactly what @andreineculau is describing.

The basic idea how to implement this is by needing to additionally check against all discard rules when transitioning from one state to the next.

On Apr 23, 2014, at 2:54 PM, Sean Farrell [email protected] wrote:

@waTeim Actually no.

So we agree. The traditional approach is sufficient. Thereā€™s no need to have
the strictly-parser part recognize the existence of discarded tokens and thereā€™s
no reason to make the lexer part behave conditionally (in a context sensitive way)
w.r.t. recognizing tokens.

Therefore thereā€™s no need to have glue elements (e.g. ā€˜#ā€™) in the language
because it suffices that

1) tokens can be created solely from regex and are not context sensitive.
2) tokens can be marked to be discarded without exception.

Traditionally the parsing process is split into lexing and parsing. During lexing every character is of significance, that including whitespaces. But these are then lexed into a "discard" token. The parser, when advancing to the next token, will then discard any discard tokens. The important part is that you can discard anything, not just whitespaces. This behavior is exactly what @andreineculau is describing.

The basic idea how to implement this is by needing to additionally check against all discard rules when transitioning from one state to the next.

ā€”
Reply to this email directly or view it on GitHub.

Ok then I misunderstood you. There may be cases for lexer states, but that is a totally different requirement and IMHO outside of the scope of peg.js.

@waTeim @rioki Forget a bit about my suggestion.

Hands on, take this rule. If you would like to simplify the rule's grammar by taking away the *WS, then how would you instruct PEGjs to not allow *WS between field_name and :?

@andreineculau Because your grammar is whitespace sensitive, this is not applicable. The discard tokens would be part of the grammar, the lexing part to be exact. I don't know what the big issue is here, this was already sufficiently solved in the 70s. Each and every language has it's own skippable tokens and where they are applicable. The whitespaces and comments are as much part of the language definition and thus part of the grammar. It just turn out that with most languages the skippable tokens may be between each and every other token and using a discard rule makes it WAY simpler than writing expr = WS lit WS op WS expr WS ";" for every rule. Just imagine a grammar like the one for C with whitepsace handling?

I understand that retconning discard rules into pegjs is not easy, but that does not mean that it is not a laudable goal.

Oh man, free response section! I have a lot to say, so sorry for the length.

1) For the TL;DR people, if I could add any peg elements, I wanted, I would have written it like this

header_field
= field_name ":" field_value

whitespace(IGNORE)
= [t ]+

The addition Iā€™d make is a options section that may be included in any production

The http-bis language would not be limited by this re-write (see appendix a).

2) My problem with the proposed #

It feels like you are exchanging requiring the user to fill the parser definition with a bunch
of discard non terminals (usually whitespace/delimiters) with requiring the user to fill
the parser definition with a bunch of ā€œhere characters are not discardedā€ meta-characters
unnecessarily. Admittedly there would be fewer occurrences of this. Itā€™s the rare case when
people do actually consume delimiters and do something with them, and like I comment in

appendix a, HTTP-bis is not one of those occurrences, just badly documented.

3) User defined parser states

But I can see how it would be easier on the parser definer to simply cut and paste the
language spec from the definition, so if you must have something like this, then
this could be done with lexical states as alluded to earlier by Sean. I think iā€™d do it
in the following way.

production1(state==1)
= stuff

production2(state==2)
= stuff

production3
= stuff {state = 1}

production4
= stuff {state = 2}

In other words, just like lex/yacc make it possible for productions to only be available

if the system is in a particular state, and allow the user to set that state value.

4) More options

Or you could make it easier on the user and more apparent to the reader with another
option

production(DONTIGNORE)
= stuff

Which would allow the parser to override the default action of discarding tokens marked
as discard but only for that one production. This is really the same as 3, just an easier
read. This is less flexible than the # proposal because either a production is all ignore

or no ignore, but I donā€™t think that extra flexibility is needed.

5) Adding a parameter to getNextToken() allows context sensitivity

I think all this comes down to is (Iā€™m making some assumptions here) currently, the
parser part calls getNextToken(input), and what needs to happen instead is add a

parameter to it getNextToken(input,options).

Appendix a) That HTTP-bis spec

Ok Iā€™ve read some but have not read all of this

Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing
draft-ietf-httpbis-p1-messaging-26

I donā€™t like the way they have defined their grammar. I donā€™t suggest changing the input it
accepts, but I would not have defined it as they did. In particular, I donā€™t like why they have
defined OWS and RWS and BWS which all equate to exactly the same character string
but in different contexts. They have defined

OWS ::== (SP | HTAB)*
RWS ::== (SP | HTAB)+
BWS ::== OWS

which is just repetition of tabs and spaces

For no good reason. They have made the language harder to parse ā€” require the lexical analyzer
to track itā€™s context ā€” and they didnā€™t need to do that.

They have defined OWS as ā€œoptional white spaceā€ BWS as ā€œbad whitespaceā€ or otherwise optional
whitespace but in the ā€œbadā€ context ā€” where it isnā€™t necessary ā€” and RWS required whitespace where itā€™s
necessary to delimit tokens. Nowhere is this whitespace used except perhaps there might be a parser
warning if it matches BWS (ā€œdetected unnecessary trailing whitespaceā€ or some such) which is all
delimiters do anyway.

In their spec, the only place RWS is used is here

Via = 1#( received-protocol RWS received-by [ RWS comment ] )

 received-protocol = [ protocol-name "/" ] protocol-version
                     ; see Section 6.7
 received-by       = ( uri-host [ ":" port ] ) / pseudonym
 pseudonym         = token

but 'protocol-version' is numbers and maybe letters, while 'received-by' is numbers and letters. In other words,
the lexical analyzer is not going to correctly recognize these 2 parts unless they are separated by whitespace
and itā€™s going to be a syntax error with or without RWS being explicitly identified if there is not at least 1
whitespace character. So just remove RWS from the productions altogether and treat whitespace
everywhere as a delimiter and it doesnā€™t change the language, just how itā€™s documented.

On Apr 24, 2014, at 1:23 PM, Andrei Neculau [email protected] wrote:

@waTeim @rioki Forget a bit about my suggestion.

Hands on, take this rule. If you would like to simplify the rule's grammar by taking away the OWS, then how would you instruct PEGjs to not allow OWS between field_name and :?

ā€”
Reply to this email directly or view it on GitHub.

@waTeim I think you are going overboard with this. I have written quite few parsers and I think the lexer states where never really useful as such. In most cases I saw them was where the lexer consumed block comments and it was "simpler" to put the lexer into "block comment mode" and write simpler patterns than the Ć¼ber pattern to consume the comment (and count lines).

I have never seen any proper use of lexer states stemming from the parser. The fundamental problem here is that with one look ahead, when the parser sees the token to switch states, the lexer has already erroneously lexed the next token. What you propose is almost impossible to implement without back-tracking and that is never a good feature in a parser.

When writing a grammar you basically define which productions are considered parsed and what can be sipped. In @andreineculau's example there are two options, either you handle white spaces in the parser or you define the trailing ":" part of the token. ([a-zA-Z0-9!#$%&'+-.^_|~]+ ":").

I might suggest turning the problem into specifying a whitelistā€”which portions do I want to capture and transformā€”instead of a blacklist. Although whitespace is one problem with the current capture system, the nesting of rules is another. As I wrote in Issue #66, the LPeg system of specifying what you want to capture directly, via transforms or string captures, seems more useful to me than specifying a handful of productions to skip and still dealing with the nesting of every other production.

See my comment in Issue #66 for a simple example of LPeg versus PEG.js with respect to captures. Although the names are a bit cryptic, see the Captures section of the LPeg documentation for the various ways that you can capture or transform a given production (or portion thereof).

Hello, I've created a snippet to ignore some general cases: null, undefined and strings with only space symbols.
It can be required in the head of the grammar file, like:

{
  var strip = require('./strip-ast');
}

The two ways to improve it:

  • Customizable filter for terms ā€” to ignore the specific terms that certain grammar is required.
  • Skip nested empty arrays ā€” this can be done on second stage after strip, it'll remove Ā«pyramidsĀ» of nested empty arrays.
    If anyone interested, we can upgrade it to a package.

@richb-hanover Where did your ASN.1 definition parser efforts land?

@atesgoral - I bailed out. I didn't need a "real parser" - I only needed to isolate certain named elements in the target file.

So I did what any wimpy guy would do - used regular expressions. (And then I had two problems :-)

But it did the trick, so I was able to move on to the next challenge. Good luck in your project!

Having had a look at chevrotain and its skip option, something like this is hugely desirable.

Too often we find ourselves writing something like this:

Pattern = head:PatternPart tail:( WS "," WS PatternPart )*
{
  return {
    type: 'pattern',
    elements: buildList( head, tail, 3 )
  };
}

Would be cool if we could write this instead:

WS "whitespace" = [ \t\n\r] { return '@@skipped' }

IgnoredComma = "," { return '@@skipped' }

Pattern = head:PatternPart tail:( WS IgnoredComma WS PatternPart )*
{
  return {
    type: 'pattern',
    elements: [head].concat(tail)
  };
}

@richb-hanover, and anybody else who got here in search of a similar need, I ended up writing my own parsers, too: https://www.npmjs.com/package/asn1exp and https://www.npmjs.com/package/asn1-tree

A skip would be relatively easy to implement using es6 symbol, or maybe more durably by passing the parser a predicate at parse time (I prefer the latter option)

Just stumbled upon this too.
Not knowing anything about the innards of PEG.js, lemme throw a bone out there...

When we write a rule, at the end of it we can add a return block.
In that block, we can call things like text() and location(). These are internal functions.

Somewhere in the code the returned value of that block goes into the output stream.

So what would need to change in PEG.js if I want to skip a value returned by a rule if that value is the return of callking a skip local function ?

e.g. comment = "//" space ([^\n])* newline { return skip() }

As mentioned above, skip() could return a Symbol, which is then checked by the code somewhere and removed.
Something like what lzhaki said, but internal to the library

I don't understand your question. Are you looking for a way to fail a rule under some circumstances? Use &{...} or !{...}. Otherwise just don't use returned value of comment rule:

seq = comment r:another_rule { return r; };
choice = (comment / another_rule) { <you need to decide what to return instead of "comment" result> };

If it helps anyone, I ignore whitespace by having my top-level rule filter the array of results.

Example:

    = prog:expression+ {return prog.filter(a => a)}

expression
    = float
    / number
    / whitespace

float
    = digits:(number"."number) {return parseFloat(digits.join(""),10)}

number 
    = digits:digit+ {return parseInt(digits.join(""),10)}

digit 
    = [0-9]

whitespace
    = [ \t\r\n] {return undefined}

This will happily parse input while keeping whitespace out of the result array.
This will also work for things like comments, just have the rule return undefined and the top level rule will filter it out

That only works for top-level productions. You have to manually filter every parent that could contain a filterable child.

@StoneCypher True, it does require some top level work, but it works for me, and I think as long as the gammar isn't too complex, one should be able to get away with having a top level filter.

Other than that, all I can think of is have a top level function that filters whitespace from input and pass every match through it. Slower for sure, and requires alot more calls, but easy if you (like me) pass everything into a token generator. You can call the filter function from where you generate tokens, and only have to worry about generating your tokens and the whitespace is more or less automatically filtered

One of the things I liked about the current HEAD of pegjs is its (undocumented) support for picking fields without having to create labels and do return statements. It looks like this:

foo = @bar _ @baz
bar = $"bar"i
baz = $"baz"i
_ = " "*
parse('barbaz') // returns [ 'bar', 'baz' ]

I feel like this gives nice, clean, explicit syntax for this use case plus a bunch of others.

Was this page helpful?
0 / 5 - 0 ratings