Nltk: Improve tokenization of Multi Word Expressions by including "python partitioner"

Created on 11 Dec 2018 · 9Comments · Source: nltk/nltk

I suspect that @jakerylandwilliams & @andyreagan's https://github.com/jakerylandwilliams/partitioner could significantly improve the tokenization quality of NLTK, specifically when it comes to MWEs (Multi Word Expressions).

@NeelShah18 recently ported it to Python 3:

https://github.com/jakerylandwilliams/partitioner/pull/7

So, including it in NLTK should seem easy enough.

For more information on the approach used there, see here:

https://noisy-text.github.io/2017/pdf/WNUT01.pdf

And here:
https://arxiv.org/abs/1710.07729

It's Apache 2.0 licensed, so the licenses seem compatible as well.

enhancement nice idea tokenizer

Source

no-identd

👍2

Most helpful comment

@alvations and @NeelShah18, I agree that pulling out and re-packaging the gazetteers and MWE segmentation resource per NLTK structure and coding norms would make the most sense for integration. There are a few models and utilities inside of https://github.com/jakerylandwilliams/partitioner and the one probably best suited for NLTK was mentioned at the top of the thread by @no-identd:

https://noisy-text.github.io/2017/pdf/WNUT01.pdf

If this is of interest, I can certainly help with the execution of some of the necessary tasks.

jakerylandwilliams on 21 Dec 2018

👍2

All 9 comments

A small note/addendum on/to this, since:
a) I neglected to mention it in the initial post; and
b) it seems mention-worthy:
Python Partitioner already makes use of NLTK.

no-identd on 12 Dec 2018

Thanks for suggesting partitioner; I hadn't seen it before. Based on the paper, it looks like it performs MWE segmentation relying on MWE-labeled training data n-gram probabilities and large lexical resources (mainly extracted from Wiktionary/Wikipedia). Unlike most statistical approaches, it avoids expensive computation, essentially deferring most of the work to frequency counts and dictionary lookups. The tool supports 18 PARSEME languages, including English and a variety of European languages.

If this is to be added to NLTK, how large would it have to be? The partitioner repo is >100MB. If there are large data files, I assume the user would have to use nltk.download() to request them.

It must take time to load the large data resources into memory in order to run the system—is it just a couple seconds, or longer?

Note that this goes well beyond standard "tokenization" in terms of orthographic lexical units, so it is not a substitute for basic tokenization or lemmatization (#1214).

nschneid on 12 Dec 2018

👍2

Unfortunately, I'll have to pass on these questions due to time constraints & lack of operational experience with partitioner, at least for the foreseeable future. Sorry! But perhaps either @jakerylandwilliams or @andyreagan can answer those questions

no-identd on 13 Dec 2018

Thanks @no-identd and @nschneid for reaching out; I'm glad the module's of interest. We're presently working on some back end, data, and model improvements for Python 3. If bringing the current version into nltk makes sense I think it would be fairly straightforward to implement.

@nschneid, your assessment of the model is correct. Large data files can take a few seconds; the only lag on load that I've seen is for the EN Wikipedia data set, but this resource can be omitted at relatively little cost to performance for essentially an instant load. It would probably make sense to set the default EN load to just omit Wikipedia.

I'm happy to carry forward the discussion and field any other questions.

jakerylandwilliams on 13 Dec 2018

❤2

@jakerylandwilliams @nschneid If we are omitted Wikipedia and even use default downloader by nltk than it is compatible with python2 and python3. I can help for multiplatform independence(python2 and python3) of the partitioner code.

NeelShah18 on 13 Dec 2018

❤1

Actually, if https://github.com/jakerylandwilliams/partitioner is already a working package in Python, there might not be a need to port/reimplement code. Users can easily choose to use the tokenizer directly form partitioner.

If we want the "good stuff" like MWE, we can take the gazetteers from partitioner, somehow package the MWE resource instead of porting the whole partitioner repo into NLTK. If the maintainers of partitioner wants to maintain the code in NLTK instead of in their pypi package, then I think it's worth the effort of porting code from third party Python libraries.

alvations on 17 Dec 2018

👍1

@alvations I agree with his suggestion. But, I see the NLTK implementation we have to rewrite as per NLTK structure and test. We also need code for portable in python2 and python3 for making it NLTK library coding norms.

NeelShah18 on 20 Dec 2018

https://noisy-text.github.io/2017/pdf/WNUT01.pdf

If this is of interest, I can certainly help with the execution of some of the necessary tasks.

jakerylandwilliams on 21 Dec 2018

👍2

🤔

no-identd on 3 Jan 2021

Was this page helpful?

0 / 5 - 0 ratings