Pegjs: How to preserve whitespace separators?

Created on 3 Oct 2019  ·  6Comments  ·  Source: pegjs/pegjs

Issue type

  • Bug Report: _no_
  • Feature Request: _no_
  • Question: _yes_
  • Not an issue: _no_

Prerequisites

  • Can you reproduce the issue?: _yes_
  • Did you search the repository issues?: _yes_
  • Did you check the forums?: _yes_
  • Did you perform a web search (google, yahoo, etc)?: _yes_

I am struggling to make PEG.js parser keep original whitespaces of the equation.

Current behavior: 2 * 5 + SUM(1, 2, 3)

[
   "2",
   "*",
   "5",
   "+",
   "SUM",
   "(",
   [
      "1",
      ",",
      "2",
      ",",
      "3"
   ],
   ")"
]

Desired behaviour: 2 * 5 + SUM(1, 2, 3)

[
   "2",
   " ",
   "*",
   " ",
   "5",
   " ",
   "+",
   " ",
   "SUM",
   "(",
   [
      "1",
      ",",
      " ",
      "2",
      ",",
      " ",
      "3"
   ],
   ")"
]

Grammar to copy: https://pastebin.com/zpwqT6Uw
PEG.js playground https://pegjs.org/online

What am I missing?

Most helpful comment

@marek-baranowski Another gentle ping :smiley_cat:

Also, I wrote a PEG.js plugin pegjs-syntactic-actions to facilitate debugging of grammars, and specifically see what characters are captured by what rule independently of the actions, which is probably your issue here as explained by @StoneCypher.

The reasoning of this plugin is: I find it is often/sometimes difficult to understand the global result when it is not what we expect, because it results from the combination of many small actions, and finding the action which behaves badly/stangely could be time-consuming. With this plugin, we see what rule captures what character, and it gives the name of the action to act on.

All 6 comments

@futagoza incredibly sorry to bother you but it is the first time I'm dealing with PEG.js and this issue is critical for me. May I ask you for a little hint?

Best Regards,
Marek

I tried looking through your grammar (yesterday and just now) but because it's really hard to understand it (naming conventions aside, the format, to be honest, is all over the place), it took me a while to track down a solution:

  1. Rules that consume space & data return it (e.g. const returns [left_space, cnst, right_space])
  2. Any rule/action that takes the result must do something like this: [].concat.apply([], con)

Even still, to be honest with you this feels like a hacky solution to me. Have you got a link to a spec or something? It would help to know what rules I can and cant change to gain the desired result without the above hacky solution.

If not, as long as your willing to put the time in and tidy up the grammar and renaming some rules (so its easier to figure out what you want), then I'll gladly try to take another stab at it 😉

@marek-baranowski - Sorry I didn't see this until now. Hopefully this is still useful to you

If you want to keep the spaces, just treat them like matchable content.

It's not really clear exactly what you'd want for two spaces. You could either have a string of two spaces, or an array of two one-space strings. Normally I'd expect the former, but ... all your thing is per-character

Also ... why would you want stray characters like that, except for the function call? The parser should be summing those up for you.

Anyway

This is what you asked for:

Document = Expression*

Whitespace
  = tx:[ \r\n]+ { return tx.join(''); }

Number
  = str:[0-9]+ { return str.join(''); }

Oper
  = '+'
  / '-'
  / '/'
  / '*'
  / ','

Label
  = l:[a-zA-Z]+ { return l.join(''); }

Parens 
  = '(' Whitespace? ex:Expression* Whitespace? ')' { return ex; }

Expression 
  = Number 
  / Oper
  / Whitespace
  / Label
  / Parens
  / [^()]+

image

Thing is, I am not super convinced that it's actually what you want. By example instead you could parse the numbers and operators, and return a standardized node shape for each one:

Document = Expression*

Whitespace
  = tx:[ \r\n]+ { return { 
    ast: 'whitespace', value: tx.join('') 
  }; }

Number
  = str:[0-9]+ { return {
    ast: 'number', value: parseInt(str,10)
  }; }

Oper
  = '+' { return { ast: 'oper', value: 'add' }}
  / '-' { return { ast: 'oper', value: 'subtract' }}
  / '/' { return { ast: 'oper', value: 'divide' }}
  / '*' { return { ast: 'oper', value: 'multiply' }}
  / ',' { return { ast: 'oper', value: 'sequence' }}

Label
  = l:[a-zA-Z]+ { return { 
    ast: 'label', value: l.join('') 
  }; }

Parens 
  = '(' Whitespace? ex:Expression* Whitespace? ')' { 
    return { ast: 'parens', value: ex 
  }; }

Expression 
  = Number 
  / Oper
  / Whitespace
  / Label
  / Parens
  / [^()]+

Now you still have your whitespace, but you also have a proper parsed tree, and don't need to write a parser to parse your parser's output, and it's also easy as pie now to start adding regularized features like line numbers and so forth

image

@marek-baranowski - I'd like to reduce the size of this issue tracker somewhat

If the above is what you need, would you please consider closing this issue? Thanks 😄

If it isn't, please let me know why, and I'll try again

@marek-baranowski Another gentle ping :smiley_cat:

Also, I wrote a PEG.js plugin pegjs-syntactic-actions to facilitate debugging of grammars, and specifically see what characters are captured by what rule independently of the actions, which is probably your issue here as explained by @StoneCypher.

The reasoning of this plugin is: I find it is often/sometimes difficult to understand the global result when it is not what we expect, because it results from the combination of many small actions, and finding the action which behaves badly/stangely could be time-consuming. With this plugin, we see what rule captures what character, and it gives the name of the action to act on.

oh wow this is really neat actually

Was this page helpful?
0 / 5 - 0 ratings