Nltk: chomsky_normal_form() for grammars

Created on 13 Nov 2017  ·  4Comments  ·  Source: nltk/nltk

nltk.tree.Tree has a chomsky_normal_form() function, but grammars don't. Since CNF is a form of the grammar, it should, also.

enhancement nice idea parsing

All 4 comments

The chomsky_normal_form() in NLTK is a tree-binarization function. I think it can't be directly applied to grammars, see https://github.com/nltk/nltk/blob/develop/nltk/treetransforms.py

Grammar transformation to CNF is rather complex and hasn't yet been implemented. It would be good if there's an attempt at it, but it might not be trival.

If anyone is interested in contributing, a good algorithm to start out is

  • Lange and Leiß (2009). "To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm"

I was looking around for exactly this and stumbled here,
I've created this functionality by exploiting nltk internals for one of my projects (and a CKY parser on top of it as well, although without null handling, but I believe its useful for a lot of grammars anyways, for eg the ATIS test grammar), would be happy to send a PR, just wanted to confirm if this issue still holds and is available to work on (especially after the last comments in https://github.com/nltk/nltk/issues/1722)

Also if there's no CYK parser as mentioned in the other issue, I'd be happy to send a PR for that as well

@virresh @alvations Sorry for letting #1722 get stale. Life has gotten in the way.

@virresh If you'd like to take a wack at this be my guest. I wrote a conversion to CNF years ago.

https://github.com/aetilley/pcfg/blob/master/src/pcfg.py#L524

I remember it reordering the steps in the usual algorithm but I also remember convincing myself that they were equivalent. Proceed with caution.

I'm going to close the other issue.

There is a problem with CNFs for CFGs; they are returning duplicated productions. You can test this by:

import nltk
grammar = nltk.data.load("grammars/large_grammars/atis.cfg")
grammar = grammar.chomsky_normal_form()
print(len(grammar.productions()))
print(len(list(set(grammar.productions()))))

grammar has 20344 productions, which when converted to a set give 12396 productions.

Was this page helpful?
0 / 5 - 0 ratings