Pegjs: How to preserve whitespace separators?

Created on 3 Oct 2019 · 6Comments · Source: pegjs/pegjs

Issue type

Bug Report: _no_
Feature Request: _no_
Question: _yes_
Not an issue: _no_

Prerequisites

Can you reproduce the issue?: _yes_
Did you search the repository issues?: _yes_
Did you check the forums?: _yes_
Did you perform a web search (google, yahoo, etc)?: _yes_

I am struggling to make PEG.js parser keep original whitespaces of the equation.

Current behavior: 2 * 5 + SUM(1, 2, 3)

[
   "2",
   "*",
   "5",
   "+",
   "SUM",
   "(",
   [
      "1",
      ",",
      "2",
      ",",
      "3"
   ],
   ")"
]

Desired behaviour: 2 * 5 + SUM(1, 2, 3)

[
   "2",
   " ",
   "*",
   " ",
   "5",
   " ",
   "+",
   " ",
   "SUM",
   "(",
   [
      "1",
      ",",
      " ",
      "2",
      ",",
      " ",
      "3"
   ],
   ")"
]

Grammar to copy: https://pastebin.com/zpwqT6Uw
PEG.js playground https://pegjs.org/online

What am I missing?

Source

marek-baranowski

Most helpful comment

@marek-baranowski Another gentle ping :smiley_cat:

Also, I wrote a PEG.js plugin pegjs-syntactic-actions to facilitate debugging of grammars, and specifically see what characters are captured by what rule independently of the actions, which is probably your issue here as explained by @StoneCypher.

The reasoning of this plugin is: I find it is often/sometimes difficult to understand the global result when it is not what we expect, because it results from the combination of many small actions, and finding the action which behaves badly/stangely could be time-consuming. With this plugin, we see what rule captures what character, and it gives the name of the action to act on.

Seb35 on 2 Apr 2020

🚀3

All 6 comments

@futagoza incredibly sorry to bother you but it is the first time I'm dealing with PEG.js and this issue is critical for me. May I ask you for a little hint?

Best Regards,
Marek

marek-baranowski on 4 Oct 2019

I tried looking through your grammar (yesterday and just now) but because it's really hard to understand it (naming conventions aside, the format, to be honest, is all over the place), it took me a while to track down a solution:

Rules that consume space & data return it (e.g. const returns [left_space, cnst, right_space])
Any rule/action that takes the result must do something like this: [].concat.apply([], con)

Even still, to be honest with you this feels like a hacky solution to me. Have you got a link to a spec or something? It would help to know what rules I can and cant change to gain the desired result without the above hacky solution.

If not, as long as your willing to put the time in and tidy up the grammar and renaming some rules (so its easier to figure out what you want), then I'll gladly try to take another stab at it 😉

futagoza on 4 Oct 2019

@marek-baranowski - Sorry I didn't see this until now. Hopefully this is still useful to you

If you want to keep the spaces, just treat them like matchable content.

It's not really clear exactly what you'd want for two spaces. You could either have a string of two spaces, or an array of two one-space strings. Normally I'd expect the former, but ... all your thing is per-character

Also ... why would you want stray characters like that, except for the function call? The parser should be summing those up for you.

Anyway

This is what you asked for:

Document = Expression*

Whitespace
  = tx:[ \r\n]+ { return tx.join(''); }

Number
  = str:[0-9]+ { return str.join(''); }

Oper
  = '+'
  / '-'
  / '/'
  / '*'
  / ','

Label
  = l:[a-zA-Z]+ { return l.join(''); }

Parens 
  = '(' Whitespace? ex:Expression* Whitespace? ')' { return ex; }

Expression 
  = Number 
  / Oper
  / Whitespace
  / Label
  / Parens
  / [^()]+

Thing is, I am not super convinced that it's actually what you want. By example instead you could parse the numbers and operators, and return a standardized node shape for each one:

Document = Expression*

Whitespace
  = tx:[ \r\n]+ { return { 
    ast: 'whitespace', value: tx.join('') 
  }; }

Number
  = str:[0-9]+ { return {
    ast: 'number', value: parseInt(str,10)
  }; }

Oper
  = '+' { return { ast: 'oper', value: 'add' }}
  / '-' { return { ast: 'oper', value: 'subtract' }}
  / '/' { return { ast: 'oper', value: 'divide' }}
  / '*' { return { ast: 'oper', value: 'multiply' }}
  / ',' { return { ast: 'oper', value: 'sequence' }}

Label
  = l:[a-zA-Z]+ { return { 
    ast: 'label', value: l.join('') 
  }; }

Parens 
  = '(' Whitespace? ex:Expression* Whitespace? ')' { 
    return { ast: 'parens', value: ex 
  }; }

Expression 
  = Number 
  / Oper
  / Whitespace
  / Label
  / Parens
  / [^()]+

Now you still have your whitespace, but you also have a proper parsed tree, and don't need to write a parser to parse your parser's output, and it's also easy as pie now to start adding regularized features like line numbers and so forth