Pegjs: Provide a concise way to indicate a single return value from a sequence

Created on 29 Jan 2014  ·  21Comments  ·  Source: pegjs/pegjs

This is fairly common for me. I want to return just the relevant part of a sequence in the init label. In this case, just the AssignmentExpression.

pattern:Pattern init:(_ "=" _ a:AssignmentExpression {return a})?

I recommend you add @ expression which will return just the following expression from the sequence. So the above example would look like this:

pattern:Pattern init:(_ "=" _ @AssignmentExpression)?

Related Issues:

  • #427 Allow returning match result of a specific expression in a rule without an action
  • #545 Simple syntax extensions to shorten grammars
feature

Most helpful comment

Was this released to the dev tag on npm already?

https://www.npmjs.com/package/pegjs/v/0.11.0-dev.325

All 21 comments

I agree this is a common problem. But I'm not sure whether it is worth solving. In other words, I'm not sure whether it occurs often enough and whether it causes enough pain to warrant adding a complexity to PEG.js by implementing a solution, whatever that would be.

I'll keep this issue open and mark it for consideration after 1.0.0.

Agreed. I would really like to see this in 1.0. I'm not so enthusiastic about the @ syntax though. IMHO a better idea would to be to do this: if there is only a single "top level" label then implicitly return that label. So instead of:

rule = space* a:(text space+ otherText)+ newLine* { return a; }

You get:

rule = space* a:(text space+ otherText)+ newLine*

And when the label is not anything particularly meaningful also allow this:

rule = space* :(text space+ otherText)+ newLine*

So skip the label name altogether.

@mulderr I think the @ operator is better, since doing something implictly means that if I read someone else's code that uses that feature, I would have to go through quite a lot of googling before I realize what he did there. On the contrast, using an explicit operator would allow me to search the documentation quickly.

+1 for @ - This recurs a lot throughout my code.

+1 for some concise way to do this.

It seems like this pain is similar to that of the verbose function syntax in JS, which ES6 addresses using the arrow functions. Maybe somehing similar could be used here? Something like:

rule = (space* a:(text space+ otherText) newLine*) => a

Seems to me that this is pretty flexible, still explicit (a la @wildeyes's concern), and feels less like adding complexity since in both syntax and implementation it defers to the underlying JS...

I've been picturing something a little like:

additive = left:multiplicative "+" right:additive {= left + right; }

Where an = (feel free to debate the choice of character) as the first non-whitespace character of a block is turned into a return.

This would also work for full expressions and should be possible with a transformation pass.

Any news? Why not additive = left:multiplicative "+" right:additive { => left + right } ?

It would certainly feel intuitive given how arrow functions now work ((left, right) => left + right).

There are actually quite a few examples of places in your parser.pegjs file that would be improved by this feature.

For instance:

  = head:ActionExpression tail:(__ "/" __ ActionExpression)* {
      return tail.length > 0
        ? {
            type: "choice",
            alternatives: buildList(head, tail, 3),
            location: location()
          }
        : head;
    }

Is fragile because of the magic number 3 in the buildList call which is non-intuitively linked to the position of your ActionExpression in the sequence. The buildList function itself is complicated by combining two different operations. By using the @ expression and es6 spread syntax, this becomes cleaner:

  = head:ActionExpression tail:(__ "/" __ @ActionExpression)* {
      return tail.length > 0
        ? {
            type: "choice",
            alternatives: [head, ...tail],
            location: location()
          }
        : head;
    }

Since this is pretty much just syntactic sugar, I was able to add this feature to your parser just by modifying the ActionExpression in parser.pegjs

  = ExtractSequenceExpression
  / expression:SequenceExpression code:(__ CodeBlock)? {
      return code !== null
        ? {
            type: "action",
            expression: expression,
            code: code[1],
            location: location()
          }
        : expression;
    }

ExtractExpression
  = "@" __ expression:PrefixedExpression {
      return {
        type: "labeled",
        label: "value",
        expression: expression,
        location: location()
      };
    }

ExtractSequenceExpression
  = head:(__ PrefixedExpression)* _ extract:ExtractExpression tail:(__ PrefixedExpression)* {
      return {
        type: "action",
        expression: {
          type: "sequence",
          elements: extractList(head, 1).concat(extract, extractList(tail, 1)),
          location: location()
        },
        code: "return value;",
        location: location()
      }
    }

I checked in a gist that shows how the parser.pegjs would be simplified by using the @ notation.

The extractOptional, extractList and buildList functions have been completely removed, since the @ notation makes it trivial to extract desired values from a sequence.

https://gist.github.com/krisnye/a6c2aac94ffc0e222754c52d69e44b83

@krisnye Here's how it would look if there was even more syntax sugar:

https://github.com/polkovnikov-ph/newpeg/blob/master/parse.np

I'm thinking of using a combination of :: and # for this syntax feature (see my explantion/reason in #545):

  • :: the binding operator
  • # the expansion operator
// this is imported into grammar
class List extends Array {
  constructor() { this.isList = true; }
}

// grammar
number = _ ::value+ _
value = ::int #(_ "," _ ::int)* { return new List(); }
int = $[0-9]+
_ = [ \t]*
  • On a rule's root sequence, :: will return the result of the expression as the rules result
  • In a nested sequence, :: will return the result of the nested expression as the sequences result
  • If more then one :: is used, the result of the marked expressions will be returned as an array
  • If # is used on a nested sequence that contains ::, results will be pushed into the parent's array
  • If the rule has a code block along with ::/#, then execute it first, then use the results push method

Following these rules, passing 09 , 55, 7 to the above example's generated parser would yield:

result = [
    isList: true
    0: "09"
    1: "55"
    2: "7"
]

result instanceof Array # true in ES2015+ enviroments
result instanceof List # true

On a rule's root sequence, :: will return the result of the expression as the rules result
In a nested sequence, :: will return the result of the nested expression as the sequences result

Why are these separate? Why not just ":: on a sequence makes result of argument result of the sequence"?

If more then one :: is used, the result of the marked expressions will be returned as an array.

That's a bad idea. It'd rather bail out in this case. There is no sense to combine values of several types into an array. (That'd be a tuple, but they are pretty much useless in JS.)

If # is used on a sequence, the results will be Array#concat'ed into the parent's array

Now that's a horrible idea. This is obviously a kludge to get rid of extraneous { return xs.push(x), xs }. Creating such special case doesn't make any sense, because it could be solved in generic way with parameterized rules. There's not a lot of single-char sequences to use for operators, and we shouldn't be wasting them.

If the rule has a code block along with ::/#, then execute it first, then use the results push method

So f = ::"a" { return "b" } should have ["a", "b"] as result?

passing 09 . 55: 7 to the above example's generated parser would yield:

It wouldn't, there is no . or : mentioned. I don't see how described behavior would produce the result either.

number = _ ::value+ _

Also that's not the way _ should be used. It goes either to the right or to the left of the token, side chosen once for grammar. In case it's to the right, main rule should also start with _ and vice versa.

start = _ value
value = ::int #("," _ ::int)* { return new List(); }
number = ::$[0-9]+ _
_ = [ \t]*

An example of classes implemented in the JavaScript example (not meant to be a real implementation):

ClassMethod
  = head:FunctionHead __ params:FunctionParameters __ body:FunctionBody {
      return {
        type: "method",
        name: head[ 1 ],
        modifiers: head[ 0 ],
        params: params,
        body: body
      };
    }

// `::` inside the zero_or_more "( ... )*" builds an array as we want,
// so this rule returns `[FunctionModifier[], Identifier]` as expected
MethodHead = (::MethodModifier __)* ("function" __)? ::Identifier

// https://github.com/tc39/proposal-class-fields#private-fields
MethodModifier
  = "#"
  / "static"
  / "async"

FunctionParameters
  = "(" __ head:FunctionParam tail:(__ "," __ ::FunctionParam)* __ ")" {
      // due to `::`, tail is `FunctionParam[]` instead of `[__, "", __, FunctionParam][]`
      return [ head ].concat( tail );
    }
    / "(" __ ")" { return []; }

FunctionParam
  = name:Identifier value:(__ "=" __ ::Expression)? {
      return { name, value };
    }

FunctionBody = "{" __ ::SourceElements? __ "}"

Why are these separate? Why not just :: on a sequence makes result of argument result of the sequence"?

Isn't that the same thing? I've only made it more clear so that the result of different use cases like MethodHead, FunctionParam and FunctionBody can be easily understood.

There is no sense to combine values of several types into an array. (That'd be a tuple, but they are pretty much useless in JS.)

When building an AST (the most common result of a generated parser), in my opinion, it would simplify use cases like MethodHead, instead of writing:

MethodHead
  = modifiers:(::MethodModifier __)* ("function" __)? name:Identifier {
      return [ modifiers, name ];
    }

After thinking it through more though, although it simplifies the use case, it also opens the possibility of the developer making a mistake (either in the way they implement their grammar, or how the results are handled by actions), therefore I think putting this use case behind an option like multipleSingleReturns (default: false) would be the best course of action here (If this feature get's implemented that is).

If # is used on a sequence, the results will be Array#concat'ed into the parent's array

Now that's a horrible idea. This is obviously a kludge to get rid of extraneous { return xs.push(x), xs }

It also helps in the more common use cases like FunctionParameters where it would be nicer to write:

// should always return `FunctionParam[]`
FunctionParameters
  = "(" __ ::FunctionParam #(__ "," __ ::FunctionParam)* __ ")"
  / "(" __ ")" { return []; }

it could be solved in generic way with parameterized rules

I'm thinking that I should be implementing single return values before parameterized rules as I'm still not sure how to proceed with the later (I like using templates, but using rule < .., .. > = .. seems to add a lot of noise to the PEG.js grammar, so I'm trying to think of a slight syntax alteration to make it fit in better), but that's a separate issue.

There's not a lot of single-char sequences to use for operators, and we shouldn't be wasting them.

That's true, but if we don't use them when the occasion present's itself like this, then that's just as bad.

I thought of using # for this use case after remembering that it is used as an expansion operator in some languages that implement preprocess directives, and that's essentially what this use case is covering here.

So f = ::"a" { return "b" } should have ["a", "b"] as result?

No, as the code block is expected by the parser to return an array-like object that contains a push method (so it would be f = ::"a" { return [ "b" ] } returning [ "b", "a" ] as result), therefore allowing to push not only into arrays, but custom nodes that have the same method implemented to work in a similar fashion. Seeing as how that was your first train of thought after reading that, would it be better to understand if this was behind an option like pushSingleReturns? If this option is false (default), having a code block after a sequence that contains single return values would throw an error.

passing 09 . 55: 7 to the above example's generated parser would yield:

It wouldn't, there is no . or : mentioned.

Sorry, that was a mistake that got left in when I rewrote the given example.

Also that's not the way _ should be used. It goes either to the right or to the left of the token, side chosen once for grammar. In case it's to the right, main rule should also start with _ and vice versa.

I think this is just a matter of preference really :smile:, although I think it would have been easier to understand this example:

number = _ ::value+ EOS
...
EOS = !.

I don't see how described behavior would produce the result either.

Does this updated comment help to understand what I'm trying to say, and if not, what don't you understand.

I agree with @polkovnikov-ph that the offered changes by places are extremely unobvious and will be only a source of additional errors.

If # is used on a sequence, the results will be Array#concat'ed into the parent's array

What should be returned in the grammar start = #('a')? As far as I understand, he thought how syntax sugar for flatten arrays? I do not think that this operator is necessary. For its most obvious use -- expressions of member lists with separators -- the special syntax is better to make (see #30 and my fork, https://github.com/Mingun/pegjs/commit/db4b2b102982a53dbed1f579477c85c06f8b92e6).

If the rule has a code block along with ::/#, then execute it first, then use the results push method

Extremely unobvious behavior. Actions in the grammar source are located after expression and usually they are called after parse of expression. And suddenly somehow they begin to be called before parse. How shall labels behave?

The remained 3 points you managed to describe so that their clear sense began to escape. How correctly did note @polkovnikov-ph why was in the description to separate the first and second case? Only two simple rules shall be executed:

  1. :: (frankly speaking, I do not like a choice of this character, too noisy) before expressions leads to the fact that from the sequence node elements marked with this character return
  2. If in the sequence only one such element, its result is returned, otherwise the array of results returns

Examples:

start =   'a' 'b'   'c'; // => ['a', 'b', 'c']
start = ::'a' 'b'   'c'; // => 'a'
start = ::'a' 'b' ::'c'; // => ['a', 'c']

The big example describes exactly what I'd expect to see from :: and doesn't describe dubious use cases (#, several ::).

Isn't that the same thing?

That's what I've been asking. The description in generic way is usually more useful, because it makes reader be sure that it's really the same thing. Thank you for clarification. :)

in my opinion, it would simplify use cases like MethodHead

But why not create an object instead? There are modifiers: and name: in notation, leave them as is in resulting JS object, and that will be cool.

it also opens the possibility of the developer making a mistake (either in the way they implement their grammar, or how the results are handled by actions)

Initially I was going to write about this, but then decided I don't have enough hard arguments. I'd rather not allow multiple :: on the same level of sequence at all (not even with a flag).

It also helps in the more common use cases like FunctionParameters where it would be nicer to write:

But that's the same thing. Really nice syntax would be inter(FunctionParam, "," __), with

inter a b = x:a xs:(b ::a)* { return xs.unshift(x), xs; }

rule < .., .. > = .. seems to add a lot of noise to the PEG.js grammar

People expect <...> to be used for types, while in this case arguments are not types. The best way is Haskell way without any extra chars at all (see inter above). I'm not sure how this interacts in PEG.js grammar with omitted ;. There might be a case when f a b = ... gets (partially) included into the previous row.

used as an expansion operator in some languages that implement preprocess directives

Yes, but I'd rather use it smart. Instead of proposed push action on arrays, I'd use it as Object.assign action on objects, or even a thing that matches to an empty string (eps), but returns its argument. So that, for example,

f = type:#"ident" name:$([a-z]i [a-z0-9_]i+)

would return {type: "ident", name: "abc"} for input "abc".

No, as the code block is expected by the parser to return an array-like object that contains a push method (so it would be f = ::"a" { return [ "b" ] } returning [ "b", "a" ] as result),

O_O

I think this is just a matter of preference really

Not only that, but also performance. If every token has _ on both sides, whitespace character sequences match only trailing ones, while preceding _ match with nothing. Extra calls to parse$_ take an extra small bit of time. Also the code is longer because it has twice as more _s.

@Mingun I think @futagoza explores the design space. That's a good thing to do, especially if it's public and we have a chance to disagree :)

The main thing to do is not to ask "why" but to say "no! not like that!"

image

f = type:#"ident" name:$([a-z]i [a-z0-9_]i+) would return {type: "ident", name: "abc"} for input "abc".

Please no, haha. That syntax is _wayyy_ too magical.

I think the originally proposed @ operator is perfect as-is. So many, many times I run into the sequence problem:

sequence
    = first:element rest:(whitespace next:element {return next;})*
    {
        return [first].concat(rest);
    }
    ;

_Such_ a pain to type out over and over again, especially when they're any more complex than that.

However, with the @ operator, the above becomes simply:

sequence = first:element rest:(whitespace @element)* { return [first].concat(rest); };

and with either https://github.com/pegjs/pegjs/issues/235#issuecomment-66915879 or https://github.com/pegjs/pegjs/issues/235#issuecomment-67544080 that gets further reduced down to:

sequence = first:element rest:(whitespace @element)* => [first].concat(rest);
/* or */
sequence = first:element rest:(whitespace @element)* {=[first].concat(rest)};

... the first of which I'm very partial to.

This seems like it would be a backwards compatible change that would be simple to achieve (it appears someone has already done it).

In fact, if I'm not mistaken, it could be a mere minor bump. Might be something to think about for 0.11.0 @futagoza.

Just added this to master. Was planning to use :: for multi plucking and @ for single plucking, but using :: with labels looked really ugly and confusing, so hung that idea 🙄

I started implementing this on my own a while back but dropped it until now (when I should be doing #579 instead 😆) and based the bytecode generator on Mingun's implementation (https://github.com/Mingun/pegjs/commit/1c1c852bae91868eaa90d9bd9f7e4f722aa6435e)

You can give it a try here: https://pegjs.org/development/try (the online editor, but using PEG 0.11.0-dev)

Holy turnaround time, batman. Awesome job @futagoza, this worked perfectly - very much appreciated. Was this released to the dev tag on npm already? I'd love to start testing with it.

For anyone that wants to try it with a basic grammar, put this puppy in there and give it an input like "abcd".

foo
    = '"' @$bar '"'
    ;

bar
    = [abcd]*
    ;

Was this released to the dev tag on npm already?

https://www.npmjs.com/package/pegjs/v/0.11.0-dev.325

Hi, this is another one of those issues that just disappears if we have es6, and we need those characters for other things that have been added to es6 since. Adding operators for things you can already do is very counterproductive.

This ticket merged

pattern:Pattern init:(_ "=" _ @a:AssignmentExpression)?

The same thing in es6, which everyone will inherently understand, and which comes for free when the other parts of the parser are finished, is

pattern:Pattern init:(_ "=" _ a:AssignmentExpression)? => a

Problematically, when tested, this pluck implementation seems to be buggy, and of course, this is marked closed because this is fixed in a branch that will never be released

Please re-open this issue, @futagoza , until this is fixed in a released version

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mattkanwisher picture mattkanwisher  ·  5Comments

mikeaustin picture mikeaustin  ·  7Comments

dmajda picture dmajda  ·  15Comments

emmenko picture emmenko  ·  15Comments

StoneCypher picture StoneCypher  ·  8Comments