Pegjs: Import/include other grammars

Created on 16 Aug 2011  ·  32Comments  ·  Source: pegjs/pegjs

It could be extremely useful to have the ability to define grammars by importing rules from other grammars.

Several ideas ;

@include "expression.pegjs"
(or @from "expression.pegjs" import expression)

tag_if
    = "if" space? expression space? { ... }

@import "expression.pegjs" as expr

tag_if
    = "if" space? expr.expression space?

Ideally, this would not re-generate the whole code in every .pegjs that includes another ; maybe we would have to modify a little the behaviour of parse() to something of the like ;

Editing as per what you were saying in the options issue ;

parse(input, startRule)
->
parse(input, { startRule: "...", startPos : 9000 })

And at the end, if startPos != 0 && result !== null, we don't check if we went until input.length, but instead return the result as well as the endPos (don't really know how to do that elegantly - maybe simply modifying the options parameter ?).

It would allow reusability of grammars and modularisation of the code, which I think are two extremely important aspects of coding in general.

feature

Most helpful comment

@Dignifiedquire I am currently thinking about syntax & semantics that can probably be best explained by an example:

static-languages.pegjs

langauges  = "C" / "C++" / "Java" / "C#"

dynamic-languages.pegjs

languages = "Ruby" / "Python" / "JavaScript"

all-languages.pegjs

static  = require("./static-languages")
dynamic = require("./dynamic-languages")

all = static.languages / dynamic.languages

Each .pegjs file would implicitly define a module that would export all the rules it contains. The <name> = require(<module>) construct would import such a module. Its rules would then be available inside a namespace.

This design is deliberately similar to Node.js. Using namespaces will avoid conflicts. There are two downsides I see:

  1. The <name> = require(<module>) construct is too similar to rule definitions and thus can be confusing (one might think that just one rule is imported).
  2. The . syntax conflicts with the current meaning of ., which is “any character”. This can be solved by ugly hacks (e.g. . surrounded by whitespace means “any character”, while . surrounded by identifiers separates a namespace name from a rule name) or by changing the syntax (e.g. using any keyword to represent “any character”).

All 32 comments

I agree that this is an important feature, I want to do this after version 1.0.

(BTW I don't like the Python-like syntax you propose — something similar to Node.js's require would be better because it would be more familiar to JavaScript programmers. But this is a minor thing that can be ironed out later.)

Would you consider it for inclusion before 1.0 if provided with a patch ?

I agree on your remark about the python syntax.

+1 for this feature

@ceymard Yes, I would consider it.

+1 for the feature and +1 for require style inclusion

@dmajda @ceymard Do you have any thoughts already on how to implement this? I need this for a project at work and will try to implement. The question is should this be just an addition to split grammars into multiple files or something like inheritance, so one could inherit all rules for example and then overwrite specific rules in the new grammar.

@Dignifiedquire I am currently thinking about syntax & semantics that can probably be best explained by an example:

static-languages.pegjs

langauges  = "C" / "C++" / "Java" / "C#"

dynamic-languages.pegjs

languages = "Ruby" / "Python" / "JavaScript"

all-languages.pegjs

static  = require("./static-languages")
dynamic = require("./dynamic-languages")

all = static.languages / dynamic.languages

Each .pegjs file would implicitly define a module that would export all the rules it contains. The <name> = require(<module>) construct would import such a module. Its rules would then be available inside a namespace.

This design is deliberately similar to Node.js. Using namespaces will avoid conflicts. There are two downsides I see:

  1. The <name> = require(<module>) construct is too similar to rule definitions and thus can be confusing (one might think that just one rule is imported).
  2. The . syntax conflicts with the current meaning of ., which is “any character”. This can be solved by ugly hacks (e.g. . surrounded by whitespace means “any character”, while . surrounded by identifiers separates a namespace name from a rule name) or by changing the syntax (e.g. using any keyword to represent “any character”).

@dmajda As the <identifier> = <expression> pattern is already taken by the rule definitions, why not do something like this:

static := require("./static-languages")
dynamic := require("./dynamic-languages")

all = static::languages / dynamic::languages

The :: is not used anywhere that I know of in PEG.js and makes it easy to distinguish between namespaces and other things. I'm not sure about the := it brings the point across but feels very foreign for Javascript..

Also if you want to use namespaces, do you think there should be only one namespace per file or should there be a way of creating multiple namespaces in one file like this:

static := {
  languages  = "C" / "C++" / "Java" / "C#"
}

dynamic := {
  languages = "Ruby" / "Python" / "JavaScript"
}

I'm not much of a fan of :: and :=, they look alien in javaScript/CoffeeScript world.

I'd also like to keep things simple and define namespaces implicitly only by requiring files. I don't see a big need for anything more complicated.

How about simply:

@require foo = "./foo"

bar = foo:languages

Colons are a compromise, but they are used to separate namespaces in many places: C++, C#, XML, etc.

: will always be associated with cons for many, many functional programmers. I suggest staying away from that operator. :: looks fine to me. Isn't that used for C++ namespaces? I'm not convinced yet that . is a bad choice, either.

. can't be used without a breaking change. It would be ambiguous in the language.

:: is used in C++ for namespaces, and in C# for namespace prefixes (global::System, for example).

I was thinking of a quick workaround on this topic - to solve simple inheritance only - glue pegjs files together, while having everything namespaced.

This might make grammars too verbose, and involves a building step - but looking at the bright side, it would force you to have granular DRY&OTW grammars

And regarding the markup, no saying that this is a proper fit to this thread, but just an option to consider, I was going for a simple __

languages = static__languages / dynamic__languages
<static-languages.pegjs>
<dynamic-languages.pegjs>
/* alternative */
languages = STATIC__languages / DYNAMIC__languages

@andreineculau I'm basically already doing this with a build step, so if you and others are just looking for something to generate useful parsers from a grammar with a dependency tree (where a single parser implementing the combined grammar is generated), I might clean what I have up and release it so the discussion can refocus on how to deal with this in a more permanent way.

Another thing: approaching this primarily by designing extensions to the grammar syntax misses something important, which is that one of the main reasons we all have the itch to pull in rules from other grammars (another being clarity) is the need to write parsers that share a lot of logic. So, while generated parsers might never be meaningfully re-composable at parse-time, it seems important that a tree of grammars generate a tree of parsers, rather than one monolithic parser. It's most important when a set of parsers will be part of a web UI, but it generally doesn't hurt to avoid unnecessary bloat in generated code.

@odonnell +1 for releasing anything - no matter if you have the time to clean it up

and +1 for the clarification. This should be treated as a quick workaround, not a long-term proper solution.

@odonnell my take on it is online at https://github.com/andreineculau/core-pegjs - please poke me if you have something better.

+1 for this feature

:+1:

:+1:

:+1:

I went and wrote a plugin/extension for PEG.js that does imports: https://github.com/casetext/pegjs-import.

+1 for this as well.

I implement this in #308 in generic way: inclusion of grammar is only one way to implement decomposition rules.

Great feature :+1:

Looking forward to seeing it released.

:+1:

Awesome! :+1:

@dmajda I'm coming late to this party, but I wonder how often we need to import many rules from another library. I would love to be able to import things like Url and Email into my composed grammars but I don't care that Url may also have things like HierarchicalPart and AsciiLetter. Do you think something like Node's named exports would be a viable way forward, keeping the benefits of namespacing but allowing direct named imports?

import { SchemalessUrl, Url } from "./Urls.pegjs"

Token
  = PhoneNumber
  / Url
  / SchemalessUrl

Namespacing has been an issue for me as I try and explore writing otherwise-composable grammars. I'm stuck right now including files in files and naming things the way PHP functions were named before they introduced proper namespaces: UrlIpHost, HtmlQuotedString, etc…

@dmajda @futagoza

Any progress on this issue? or the primary discussion living now on #473 ?
My grammar file is growing very fast :(
It would be nice to split it several ones

I wouldn't mind being able to split grammars between files, simply for organization and composition. It would make them easier to test and re-use, as well as providing a way swap grammars dynamically, maybe? Just some thoughts.

The JavaScript example that I used as a base is over 1,300 lines. It took a while to learn where everything was, and jump around and edit different sections.

@mikeaustin I see this feature as some kind of Node.JS required:

cat bash.pegjs
{
const _ = require("whitespace");
const LB = require("line_break");
const CodeBlock = require("code_block");
const BoolExpr = require("boolean_expression");
}
...
IfStatement = "if" _ "[" BoolExpr "]" _ ";" _ "then" LB? CodeBlock "fi"

I agree, splitting grammars and making them modular is a great feature, however handling these case's would be a a problem:
1- sub-grammar that relies on a global variable that was defined in the main grammar code ?
2- duplicate variables and grammar name ?

IMO, a temporally convenient approach would be creating a new addon for PEG.js (independent from PEG.js) that defines a keyword for importing (for example @load(anotherGrammarFileLocation) ) keyword should not part of javacsript/peg.js grammar,
build a reg-exp or a peg grammar to detect that keyword and substitute it with "anotherGrammarFile Location" content , and send the substituted code to PEG.js

Example:

integers.pegjs

integers=[0-9]* {return parseInt(text())}

main.pegjs
arrayOfInteger="["(integers ",")* integers"]"
@load("integers.pegjs")

Note using this method, if someone did not define the start grammar, and placed @load before "arrayOfInteger" peg.js will assume the first grammar as the start ( integers grammar)

One approach to handle this is , use same names of filename and start grammar and let the new ad-don manually configure the start attribute as the file name, or substitute all content at the end of file.

user should be responsible of any duplication .

I just want to highlight that this issue is primarily an optimization request, because composability/modularity is something that you can achieve on your own, especially when you control the full spectrum of the grammar.

If you're not comfortable with a grammar 1k-lines long, then split it up, and concatenate it back as you see fit before pumping it into pegjs.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

audinue picture audinue  ·  13Comments

emmenko picture emmenko  ·  15Comments

vldmr1986 picture vldmr1986  ·  12Comments

marek-baranowski picture marek-baranowski  ·  6Comments

dmajda picture dmajda  ·  15Comments