Pegjs: Ability to specify repetition count (like in regexps)

Created on 11 Aug 2011  ·  22Comments  ·  Source: pegjs/pegjs

It would be helpful if the PEG.js grammar allowed something like range expressions of POSIX basic regular expressions to be used. E.g.:

  • "a"\{1,7\}

matches a, aa, ..., aaaaaaa

  • "a"\{0,1\}

matches the empty string and a

  • "a"\{,6\}

matches a string with up to (and including) six a's

  • "a"\{6,\}

matches a string of six or more a's

  • "a"\{3\}

matches only aaa, being equivalent to "a"\{3,3\}

feature

Most helpful comment

I would love repetition counts as well. But I would suggest a slightly different syntax. Pegasus is almost identical to pegjs, only for C#. See here: https://github.com/otac0n/Pegasus/wiki/Syntax-Guide#expressions

And they implemented this feature using this: d<3> e<2,> f<1,5>

All 22 comments

I will not implement this feature.

The main reason is that there is no room in the PEG.js grammar for the {m,n} syntax — braces are already taken for actions and I don't want to use backslashes as you suggest (they are ugly and not compatible with Perl regexps which are the most used ones now and also source of other PEG.js syntax) or other delimiters (that would be confusing).

In my experience this kind of limited repetition occurs mainly on the "lexical" parts of the grammar (rules like color = "#" hexdigit hexdigit hexdigit hexdigit hexdigit hexdigit) and not that often. I thinks it's OK to just use sequences of expressions and existing repetition operators (*, +, ?) there.

I've reconsidered and I am reopening this issue. It seems that ability to specify arbitrary number of repetitions is wanted a lot by users.

I'd like to avoid regexp-like {m,n} syntax because { and } are already taken for actions and re-using them would create ambiguity. I am currently thinking about something like this:

"foo" @ 1..10   // repeat 1 to 10 times
"foo" @ 1..     // repeat at least once
"foo" @ ..10    // repeat at most 10 times

The biggest question is what the separating character(s) should be and how to mark up ranges.

As for the separating character, @ seems nice to me. I was considering % and #, but in my mind the first one is already associated with string interpolation (e.g. in Python) and the second one with comments (in various languages). I am also thinking about skipping the separator entirely:

"foo" 1..10   // repeat 1 to 10 times
"foo" 1..     // repeat at least once
"foo" ..10    // repeat at most 10 times

As for the range markup, I took inspiration in Ruby. I was also thinking about -, but it looks too much like a minus sign. On the other hand, Python-like : looks also nice to me.

I am not sure about half-open ranges. Maybe it would be better to mark them up using + and - like this:

"foo" @ 1+    // repeat at least once
"foo" @ 10-   // repeat at most 10 times

Any ideas or comments?

Really cool that you plan to support this feature!

I like your (default) suggestion:
"foo" @ 1..10 // repeat 1 to 10 times
"foo" @ 1.. // repeat at least once
"foo" @ ..10 // repeat at most 10 times

I don't like the +/- syntax for half-open ranges, the double-dot syntax is much more intuitive and readable IMO.

The only thing I had second thoughts about was using "#" vs "@", because IMO "#" naturally implies numbers/counting, whereas "@" naturally implies a reference, so "#" may be a bit more intuitive and readable (and perhaps you could use the "@" in the future for something?). But that's really a minor issue, and I would be happy with the "@" syntax.

Cheers!

Just a quick comment: I think that @ and % are better choices than # because syntax highlighters that do not support the PEG.js grammar, especially those that attempt to guess the syntax (e.g. Stack Overflow's code highlighter), will likely interpret # as the start of a comment, causing it to be shown—annoyingly—from that point until EOL in the "comment color". This is not a preference based on logic and reasoning, of course, but on pragmatism.

How about we special case for {num, num} alike? Which WILL mean repetition, since { , num} and { num, } aren't valid js code, and {num, num} and { num } are pointless.

They aren't likely to be meaningful even if the action is of other languages.

I like these variants among suggested (but this is up to you of course to choose, since you're the author :) ):

// why we need separator, anyway? for me it looks very cool and simple to understand
"foo" 1..10   // repeat 1 to 10 times
"foo" 1..     // repeat at least once
"foo" ..10    // repeat at most 10 times

or

"foo"@1..10   // repeat 1 to 10 times
"foo"@1..     // repeat at least once
"foo"@..10    // repeat at most 10 times

but the second is less preferable

the x..y / ..y / x.. idea looks very cool, since .. looks as consistent operator thanks to it.

+/- are not ok as for me, because they confuse and become the additional operators above the .. (and + is already used)

Thinking about it again. Will these work?

'foo'<1,5>
'foo'< ,3>
'foo'<2, >

since < and > are currently unused by the grammar

:+1: from me, that looks good.

of course, <,3> is equivalent to <0,3>, so we may as well just require the min number. This would be congruent with what ECMA has done for JavaScript regular expressions.

I like the <,>. But I would also suggest the use of <3> being the same as <3,3>.

I agree, the <> syntax should map directly to the behavior of {} in RegExp as much as possible.

If I'm not mistaken, there's no need to add any delimiter, unless you want to allow variable names in the ranges.

foo 1,2 fighter
bar ,3 tender
baz 4, lurhmann
qux 5 quux

are all unambiguous.

@pygy, the problem with not using a delimiter is that it potentially stifles evolution of the syntax of the language.

For example, if we wanted to use comma for something else later on down the road, we would now have issues with syntax collisions all over the place. Constraining it to within <> brackets reduces the surface area of commas and numbers substantially.

Plus, people are used to using the {1,6} style in RegExps anyways.

I don't feel strongly about the syntax, but I do want this feature, and it'd be great if an expression could be used as a range value.

My use case: parsing literals in IMAP server responses, which look like {42}\r\n..., where 42 is the number of characters after the newline that represent a string (shown here as an ellipsis). Since there's no ending delimiter for an IMAP literal, character counting is the only way to parse this response.

How about variables in restrictions? This is very useful for messages with header, containing its length. For example, grammar

  = len:number message:.<len,len> .* {return message;}
number
  = n:[0-9] {return parseInt(n);}

must parse

4[__] -> ['[', '_', '_', ']']
4[___] -> ['[', '_', '_', '_']
4[_] -> Error: expected 4 chars, got 3

This is useful for many protocols.

May be use that syntax:
expression |min,max|, then angle brackets can be use for template rules.

Are you still considering implementing this?
What about something similar to ABNF ranges?

exp *     // 0 or more times
exp 1*    // at least once
exp *10   // up to 10 times
exp 1*10  // 1 to 10 times

Hello. I have a complex file format to parse. It is half binary, half ASCII.

Here a simplified version of the problem:

KK4TesRandom or KK10TestATestBRandom

The logic:

<StringIndicator><StringLength><String><otherStuff>

The KK is the indicator to mark a string. The following digits (here 4 and 10) are the length of the string. Then the string itself (here Test and TestATestB). The string ist not terminated by any predictable pattern. I basically have to use the length information. I'd say this is a common pattern in binary file formats but is it possible to parse with the current grammar?

Thank you.

I implement such thing in my branch ranges-dynamic-boundary. Grammar will look so:

start = len:nx data:.|len| { return data; };
nx = n:$[0-9]+ { return parseInt(n, 10); };

@Mingun wow! That works like a charm! Thanks a lot for your implementation and the short example. I made a few tests and it works awesome. I hope the your pull-request gets accepted to the master.

I would love repetition counts as well. But I would suggest a slightly different syntax. Pegasus is almost identical to pegjs, only for C#. See here: https://github.com/otac0n/Pegasus/wiki/Syntax-Guide#expressions

And they implemented this feature using this: d<3> e<2,> f<1,5>

What are peoples work arounds for this? I'm just getting into PEGjs right now, so maybe I'm trying to turn a screw with a hammer, but I'm just trying to match between 1 and 6 digits :)

I'm using my own implementation (see the #267 for syntax, final solution supports numbers, variables, and code blocks as boundaries) and I'll prepare PR soon for the Peggy (rebranding of the PEG.js fork that is maintained)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alanmimms picture alanmimms  ·  10Comments

Coffee2CodeNL picture Coffee2CodeNL  ·  13Comments

audinue picture audinue  ·  13Comments

emmenko picture emmenko  ·  15Comments

dmajda picture dmajda  ·  7Comments