Nltk: Discussion: Resurrecting the Ngram Model

Created on 25 Mar 2016  ·  21Comments  ·  Source: nltk/nltk

Hi folks!

I'm working on making sure the Ngram Model module could be added back into NLTK and would like to bring up a couple of issues for discussion.

Issue 1
Here @afourney said it would be nice to add interpolation as an alternative to the default Katz backoff as a way of handling unseen ngrams. I've been thinking about this and I might have an idea how this could work. I'd like to run it by all interested parties.

The current class structure of the model module is as follows:

  • model.api.ModelI -> this is supposed to be an Abstract class or an Interface, I guess.
  • model.ngram.NgramModel -> extends above class, contains current implementation of the ngram model.

Here's what I propose:

  • model.api.Model -> I'm honestly not sure I see the point of this, ambivalent on whether to keep it or ditch it.
  • model.ngram.BasicNgramModel -> This is the same as NgramModel, minus everything that has to do with backoff. Basically, it can't handle ngrams unseen in training. "Why have this?" - you might ask. I think this would be a great demo of the need for backoff/interpolation. Students can simply try this out and see how badly it performs to convince themselves to use the other classes.
  • model.ngram.BackoffNgramModel -> Inherits from BasicNgramModel to yield the current implementation of NgramModel, except that it's more explicit about the backoff part.
  • model.ngram.InterpolatedNgramModel -> Also inherits from BasicNgramModel, but uses interpolation instead of backoff.

The long-term goals here are:

a) to allow any ProbDist class to be used as a probability estimator since interpolation/backoff are (mostly) agnostic of the smoothing algorithm being used.
b) to allow anyone who wants to _optimize_ an NgramModel for their own purposes to be able to easily inherit some useful defaults from the classes in NLTK.

Issue 2
Unfortunately the probability module has it's own problems (eg. #602 and (my) Kneser-Ney implementation is wonky). So for now I'm only testing correctness with LidstoneProbDist, since it is easy to compute by hand. Should I be worried about the lack of support for the more advanced smoothing methods? Or do we want to maybe proceed this way to ensure at least that Ngram Model works, and _then_ tackle the problematic probability classes separately?

Python 3 super()
When calling super(), do I need to worry about supporting python 2? See this for context.

corpus enhancement language-model nice idea tests

Most helpful comment

I think it's definitely worth having in NLTK; it's a core part of when I
teach NLP.

Is NLTK supporting deep LMs now? Is this API compatible with that?


Jordan Boyd-Graber

Voice: 920.524.9464
[email protected]

http://boydgraber.org

On Tue, Oct 3, 2017 at 11:32 PM, alvations notifications@github.com wrote:

@Copper-Head https://github.com/copper-head @jacobheil
https://github.com/jacobheil and NLTK users/devs who's interested in
N-gram language models.

Just like to check-in on the current state of the model submodule.

  • Do you think it's ready to push it out into the develop/master
    branch?
  • Is it still a topic that people actively pursue and want to see on
    NLTK?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/nltk/nltk/issues/1342#issuecomment-334041035, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAhoqh5oxzo2Y9mp88I8uwy4lmyNz9Amks5sovxngaJpZM4H4nGe
.

All 21 comments

It would be nice to have a working n-gram library in NLTK. SRILM has some Python wrappers for inference, but it has restrictive license. KenLM has a Python wrapper for doing inference, but it has dependencies in compilation. Neither has support for estimation. So currently there's no well-tested n-gram tools available for Python NLP.

@anttttti Thanks for the feedback, I feel very motivated to submit a patch seeing all this demand for the feature :)

Do you happen to have any thoughts about the specific issues I posted?

The advanced smoothing methods are simple to implement once you understand that they only differ in how the discounting and interpolation are defined. Earlier papers and much of the textbook descriptions make the models seem more complicated than they are, since people didn't understand the connections that well earlier. There shouldn't be a need for separate modules, just configuration of the smoothing. The older backoff models that were not correctly normalized are not used these days, see Joshua Goodman's "A Bit of Progress in Language Modeling" for a great summary. http://arxiv.org/pdf/1602.02332.pdf page 63 summarizes some choices for the discounting&interpolation for a unigram case, higher order models use the same recursively. Kneser-Ney is a bit more tricky with the modified backoffs.

Smoothing is not that critical for most uses. With enough data even optimized Kneser-Ney isn't better than Stupid Backoff. So just having high-order n-grams available in Python with any basic smoothing would be nice. Lidstone or Jelinek-Mercer for each order should work perfectly fine.

Issue 1) One thing that I think would be very useful is to have a utility for building a vocabulary and censoring OOV tokens. That would correct many of the silly errors that frustrated users with the old versions. I am attaching some code that does that (feel free to use or copy)
lm.txt

Issue 2a) I think that it's still useful to have Kneser-Ney; it's commonly taught and it's useful to have a reference implementation.
Issue 2b) I worry that coupling ProbDist makes this far more complicated than it needs to be. It might be simpler to keep the probability estimation within the language model for things like KN. For other models, it might be fine to use ProbDist.

@anttttti "_The advanced smoothing methods are simple to implement once you understand that they only differ in how the discounting and interpolation are defined_"

@ezubaric "_Issue 2b) I worry that coupling ProbDist makes this far more complicated than it needs to be_"

Though I haven't looked at this code in a while, my sense is that both of these statements are true.

If I recall correctly, ConditionalProbDist (and more generally ProbDist) are normalized too early for use in smoothed ngram models. E.g., while we know how likely a word is in a given context, we have a hard time reasoning about the contexts themselves (I believe an earlier patch attempted to correct this issue -- despite best efforts, it was a bit kludgy [https://github.com/nltk/nltk/pull/800]).

IMHO, the whole thing is slightly over engineered.

@afourney

IMHO, the whole thing is slightly over engineered.

Amen to that! I've been trying to make this work forever now (I submitted #800 and yea, it wasn't elegant at all) and I'm also starting to think there are just too many moving parts for it to be reasonable.

@ezubaric thanks a bunch for that file, I'm totally borrowing its spirit for the refactoring.

Based on all this feedback, here's my new take on the module structure. We have just one class: model.ngram.NgramModelCounter.

This is basically several FreqDist counters combined in a clear interface. _Training_ simply consists of recursively counting the number of ngrams as well as keeping track of the vocab size (with potentially "finalizing" some of these counts to prevent updates after training is done). @alvations I know you'd like a trie implementation for this, but I think we can start with an inefficient recursive counter for now and refactor the backend later since it shouldn't affect the interface much.

Crucially, this class does not deal with probabilities at all. That should make things significantly simpler and at the same time more flexible. All anyone needs to do to add probabilities is use their favorite OOP method (e.g. inheritance or composition) to write a class that uses NgramModel's attributes to construct it's own prob() method.

If I have time I'll submit one (or two!) examples of how adding probabilities to NgramModelCounter could work.

What do you folks think?

@Copper-Head having similar interface to KenLM as much as possible would be good for future integration: https://github.com/kpu/kenlm/blob/master/python/example.py

I think after a stable version of NgramModel from NLTK is up, I can try to refactor kenlm wrapper to use similar interface as NLTK ones, like what we did for scikit-learn.

This function would help in the padding too: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L381

I think what @Copper-Head is suggesting is a class that counts unigrams, bigrams, trigrams, etc. in a coordinated way that is convenient to consume by downstream language models. In that case, I think the kenlm API does not apply yet. (I may be wrong, but from the example posted, it doesn't look like the kenlm API deals in raw frequency counts)

I think it is also worthwhile considering a minimal language model API that consumes those ngram counts. As @Copper-Head suggests, this would be a subclass, or better yet, a completely separate interface (allowing for vastly different implementations like https://www.projectoxford.ai/weblm). Here, I think it may be reasonable to adopt the kenlm API, but think _any_ ngram LM interface ought to be simple enough that adapters can be easily written.

I think a minimal ngram API really only needs methods to (1) compute the conditional probability of a token given a context or sequence, and (2) report on the size and makeup of the known vocabulary. Most everything else can be computed via helper methods, including computations of joint probability, as well as language generation. These helpers may or may not be part of the interface.

Hmm, interesting point. I wonder though if keeping track of those counts for G-T might slow the training down a bit and unnecessarily so for folks who don't want to use that particular smoothing. I think it might make more sense to do the minimum in the basic NgramCounter class and then simply extend its training (or __init__) method in a subclass specialized for Good-Turing, or even in an implementation of the ngram API geared towards computing Good-Turing probabilities.
But I'm only just sitting down to write some of this stuff up, so maybe it won't end up being a problem in the end.

Sorry, it looks like I accidentally deleted a post. To fill in the missing context for future readers: I think it would be good to consider common smoothing techniques when designing the NgramModelCounter API. For example, allowing users to query the number of _species_ observed once, twice, or N times is important for Good-Turing smoothing (as well as Witten-Bell smoothing, etc.)

Edit: It looks like the FreqDist class already has some of this (see: FreqDist.hapaxes and FreqDist.r_Nr) I wonder if it can be re-purposed? Or if FreqDist should be the starting point.

I like the idea of just having a counts object which can then be queried with subclasses that implement specific smoothing methods.

My only concern is that training will have issues if we don't have the ability to fix the vocabulary early: it won't be consistent with standard LM training processes, and tracking all vocabulary would cause the memory to blow up (which was a huge problem with the old LM too).

Noted. I have ideas for how to address this. I'll be posting a PR later today.

PR #1351 is up!! Bring on the questions/nitpicks :)

@Copper-Head – how far are we away from being able to merge this back into the main branch?

Looking at my to-do list, I'd say I need 2-3 days of focused work.
Considering that I'm back to working on this in my free time from school and day job, I'd give it anywhere between 2 weeks and a month before I'm done with all _my_ outstanding issues. This naturally doesn't take into account random stuff that might be brought to my attention in that time.

@Copper-Head @jacobheil and NLTK users/devs who's interested in N-gram language models.

Just like to check-in on the current state of the model submodule.

  • Do you think it's ready to push it out into the develop/master branch?
  • Is it still a topic that people actively pursue and want to see on NLTK?

I think it's definitely worth having in NLTK; it's a core part of when I
teach NLP.

Is NLTK supporting deep LMs now? Is this API compatible with that?


Jordan Boyd-Graber

Voice: 920.524.9464
[email protected]

http://boydgraber.org

On Tue, Oct 3, 2017 at 11:32 PM, alvations notifications@github.com wrote:

@Copper-Head https://github.com/copper-head @jacobheil
https://github.com/jacobheil and NLTK users/devs who's interested in
N-gram language models.

Just like to check-in on the current state of the model submodule.

  • Do you think it's ready to push it out into the develop/master
    branch?
  • Is it still a topic that people actively pursue and want to see on
    NLTK?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/nltk/nltk/issues/1342#issuecomment-334041035, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAhoqh5oxzo2Y9mp88I8uwy4lmyNz9Amks5sovxngaJpZM4H4nGe
.

Hi - I would like to use the "old" feature of language model in NLTK. What is the latest version that still has the pre-trained language model (for English)?

For those finding this thread, I have kind of bodged together a submodule containing the old model code.

https://github.com/sleepyfoxen/nltk_model

@stevenbird I think we can close this finally :)

For concrete feedback on the existing implementation, folks can open separate issues.

@Copper-Head yes I agree. Congratulations! :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mwess picture mwess  ·  5Comments

BLKSerene picture BLKSerene  ·  4Comments

stevenbird picture stevenbird  ·  3Comments

stevenbird picture stevenbird  ·  4Comments

chaseireland picture chaseireland  ·  3Comments