Pegjs: [question] Advice on {cache: true} and handling reasonable out-of-memory case

Created on 22 Nov 2018 · 3Comments · Source: pegjs/pegjs

Issue type

Question: yes

Prerequisites

Can you reproduce the issue?: yes
Did you search the repository issues?: yes, looked at cache in the GitHub issues, found some issues on cache but not what I'm asking for.
Did you check the forums?: yes, tried googlegroups / cache, no related results
Did you perform a web search (google, yahoo, etc)?: yes

Description

I'm parsing a fairly heavy (500KB) piece of user-provided text using a ~1000 lines grammar.

When passing {cache: true}...
- ... and telling Node to use 3GB of heap (with --max-old-space-size=3000), the heap grows to 2.5GB, and parsing succeeds in 12s.
- ... and leaving Node 10 default to 800MB of heap, parsing crashes with an OOM.
When passing {cache: false}, as expected, parsing clocks slightly faster at 10s (non-pathological case) and doesn't balloon memory usage.

This is user data and my server resources are limited, so bumping Node to use X GB of heap isn't an option, as tomorrow I might get 1MB user data that would require X+1 GB of heap. And of course I'd like to keep using {cache: true} when possible, to " avoid exponential parsing time in pathological cases", which I've met.

What approach do you recommend?

Is there anything built into PEG.js to bail out when memory usage becomes critical?
My attempts at handling this with timeouts are not great, as memory usage might grow faster than the timeout.
As far as I know it's impossible to intercept a Node OOM.
Finally, I'm considering switching usage of {cache: true} based on the size of the input. That will cost me more CPU usage, but at least I won't OOM.
Other ideas?

Thanks for PEG.js! 🙂

Software

PEG.js: 0.10.0
Node.js: 10.13.0
NPM or Yarn: npm 6.4.1
Browser: N/A
OS: AWS Linux

performance question

Source

ronjouch

All 3 comments

Exponential parsing times is something that happens in very pathological cases, and I'd recommend to rewrite the grammar there.
Consider https://github.com/sirthias/pegdown/issues/43#issuecomment-18469752
(I'm not a contributor)

polkovnikov-ph on 23 Nov 2018

👍1

As @polkovnikov-ph pointed out, it is best to rewrite parts of your grammar that deal with pathological cases, but if you continue hitting OOM cases, it might be better to do what you (@ronjouch) suggested; toggle the _cache_ option based on the size of the input: cache: input.length >= 250000

After this (and only if you have access to the user provided text), I would suggest examining any input that hits OOM cases to locate any common pathological cases and update your grammar to explicitly handle these so that you can reduce the number of OOM cases hitting your app.

If you are still hitting OOM cases often, and are willing to not only rewrite your grammar but also add an extra pass (or few) to your toolchain, I'de suggest trying any of these methods:

splitting large input, and updating your grammar to handle partial parts of the syntax you are parsing, then when all the input pieces are parsed, join everything (this is most likely only viable if your generated parser returns AST, not sure otherwise)
use a regex to identify syntax in the user-provided text that could lead to pathological cases before sending it to one of 2 generated parsers: one normal parser and another to handle pathological cases
you can always use the PEG.js grammar to generate a parser that behaves like tokenizers, and builds a parser that can optimally parse both normal inputs, and input that contains syntax that can lead to pathological cases (will require you to invest more time in learning about parsers, and kind of defeats the purpose of a parser generator, but _is nearly always a viable option if you know what you are doing_)

futagoza on 26 Nov 2018

👍1

@polkovnikov-ph @futagoza thanks to both of you for taking the time to come back with advice 👍! That makes sense. I deployed the size workaround, and will consider rewriting the grammar next time trouble knocks at the door. Good day; closing the question.

ronjouch on 26 Nov 2018

Was this page helpful?

0 / 5 - 0 ratings