Nltk: Error in NgramModel backoff smoothing calculation?

Created on 7 Mar 2013  ·  18Comments  ·  Source: nltk/nltk

I believe there is an error in how backoff smoothing is calculated in NgramModel.

  • Consider the sequence of "words": aaaababaaccbacb with words ['a','b','c']
  • Build a bigram model (n=2). For simplicity use LidstoneProbDist smoothing
  • Notably, this sequence does not contain all bigrams prefixed by 'b' or 'c'. Thus backoff is required to get the probability of bigram 'bb', 'bc' and 'ca'
  • For context = ['a'], model.prob(w,context) looks good for all words and sums to 1
  • For context = ['b'], model.prob(w,context) doesn't look right, sums to > 1

I thought that the backoff calculation should do the following, say for the context 'b':

  • Calculate the total "missing" probability for unseen values in the 'b' context (ie. bb and bc), at the bigram level, call this Beta2
  • Calculate the total unigram probability for these unseen values (ie. 'b' and 'c'), call this Beta1
  • return (Beta2 / Beta1) * backoff.prob()

This essentially swaps in the unigram probabilities for those words that were unobserved in the bigram context, scaled appropriately, to fill in the missing probability mass.

Am I missing something? The code in NgramModel does something rather different it seems, and I couldn't make sense of it.

language-model

Most helpful comment

Entering 2016 and the 'ngram model' issue has not had any advances.

All 18 comments

I believe you are correct that this is indeed a bug.

If we assume this setup:

from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist

word_seq = list('aaaababaaccbacb')
words = ['a', 'b', 'c', '']

est = lambda freqdist, bins: LidstoneProbDist(freqdist, 0.2, bins=bins)
model = NgramModel(2, word_seq, True, True, est, 4)

We can see pretty quickly the discrepancies:

sum(model.prob(w, ['b']) for w in words)
Out[150]: 2.4583333333333335
sum(model.prob(w, ['a']) for w in words)
Out[151]: 1.0

[(w, model.prob(w, ['b'])) for w in words]
Out[152]: 
[('a', 0.6666666666666667),
 ('b', 0.875),
 ('c', 0.6666666666666667),
 ('', 0.25)]

[(w, model.prob(w, ['a'])) for w in words]
Out[153]: 
[('a', 0.47727272727272724),
 ('b', 0.25),
 ('c', 0.25),
 ('', 0.022727272727272728)]

When I was working on the NgramModel a while ago, I remember also finding the way that back-off was implemented a little confusing. Now that I haven't looked at it in a long time, I've lost what intuitive understanding I had of how it worked. It seems to me that we claim we're implementing Katz Back-off, but the calculations are a bit different from the ones on Wikipedia.

I believe it's because the LidstoneProbDist.discount function called from NgramModel._beta already takes the summing into account, but I'll have to look into it more.

def _alpha(self, tokens):
    return self._beta(tokens) / self._backoff._beta(tokens[1:])

def _beta(self, tokens):
    return (self[tokens].discount() if tokens in self else 1)

It appears to me that the beta calculations are where things are going wrong, because beta at the bigram level is much larger than beta at the unigram level, which makes the ratio, alpha, positive.

model._beta(('b',))
Out[154]: 0.16666666666666669
model._backoff._beta(())
Out[155]: 0.05063291139240506
model._alpha(('b',))
Out[155]: 3.291666666666667

I've also ruled out that it's the actual LidstoneProbDist itself that has a problem:

[(w, model._model[('b',)].prob(w)) for w in words]
Out[159]: 
[('a', 0.6666666666666667),
 ('b', 0.04166666666666667),
 ('c', 0.04166666666666667),
 ('', 0.25)]

sum([model._model[('b',)].prob(w) for w in words])
Out[161]: 1.0

I'll try to figure out how all of these parts are interconnected again and see if I can fix this. Although if anyone else wants to jump in (like @desilinguist), I'd appreciate another set of eyes on this.

Hi, and thanks for checking into this. Just a few more thoughts:

First, one thing that is confusing are the different notions of "discounting". There is the discounting achieved by the various smoothing methods. For example, simple Laplacian (add one) smoothing discounts the probability for observed words and shifts that mass over to the unobserved words. The discount() function being called in the _beta function is for the smoothing done by the ProbDist, and not (I don't think) relevant to the backoff smoothing. I think the backoff notion of discounting has to do with the probability of the subset of words that are "missing" (unobserved) for different contexts in the higher order model.

So, I've modified the code for my own purposes to do what I think is the right thing, and I've shared some snippets below. Basically, I identify the subset of words that are missing in the model for a given context, and for that subset calculate the total probability for these "missing" words and the corresponding quantity in the backoff model. The ratio is "alpha", and note that this is a function of the context. I think this implementation corresponds to what's in the Wikipedia link you provide. Also, the _beta function is no longer used in my case.

Hope this is useful for the discussion. Thanks again.

    # (Code fragment for calculating backoff)

    # Now, for Katz backoff smoothing we need to calculate the alphas
    if self._backoff is not None:
        self._backoff_alphas = dict()

        # For each condition (or context)
        for ctxt in self._cfd.conditions():
            pd = self._model[ctxt] # prob dist for this context

            backoff_ctxt = ctxt[1:]
            backoff_total_pr = 0
            total_observed_pr = 0
            for word in self._cfd[ctxt].keys(): # this is the subset of words that we OBSERVED
                backoff_total_pr += self._backoff.prob(word,backoff_ctxt) 
                total_observed_pr += pd.prob(word)

            assert total_observed_pr <= 1 and total_observed_pr > 0
            assert backoff_total_pr <= 1 and backoff_total_pr > 0

            alpha_ctxt = (1.0-total_observed_pr) / (1.0-backoff_total_pr)

            self._backoff_alphas[ctxt] = alpha_ctxt

# Updated _alpha function, discarded the _beta function
def _alpha(self, tokens):
    """Get the backoff alpha value for the given context
    """
    if tokens in self._backoff_alphas:
        return self._backoff_alphas[tokens]
    else:
        return 1

Hi all, I just wanted to chime in on this discussion, and to point out that the problem is much worse than simply having the probabilities failing to sum to 1.0

Consider the following trigram example:

#!/usr/bin/python
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist

word_seq = ['foo', 'foo', 'foo', 'foo', 'bar', 'baz']

# Set up a trigram model, nothing special  
est = lambda freqdist, bins: LidstoneProbDist(freqdist, 0.2, bins)
model = NgramModel(3, word_seq, True, True, est, 3)

# Consider the ngram ['bar', 'baz', 'foo']
# We've never seen this before, so the trigram model will fall back
context = ('bar', 'baz',)
word = 'foo'
print "P(foo | bar, baz) = " + str(model.prob(word,context))

# Result:
# P(foo | bar, baz) = 2.625

Yup -- this conditional probability is > 1.0

The nasty part is that the more the models fall back, the more the probabilities become inflated.

The problem also becomes worse as we add more training examples!

word_seq = ['foo' for i in range(0,10000)]
word_seq.append('bar')
word_seq.append('baz')

est = lambda freqdist, bins: LidstoneProbDist(freqdist, 0.2, bins)
model = NgramModel(3, word_seq, True, True, est, 3)

# Consider the ngram ['bar', 'baz', 'foo']
# We've never seen this before, so the trigram model will fall back
context = ('bar', 'baz',)
word = 'foo'
print "P(foo | bar, baz) = " + str(model.prob(word,context))

# Result:
P(foo | bar, baz) = 6250.125

As it stands, the NgramModel cannot be relied upon -- at least not with additive Lidstone smoothing.

@afourney: I believe this is intended (LidstoneProbDist has SUM_TO_ONE = False attribute)

@afourney I agree that NgramModel cannot really be used until this is fixed. Unfortunately, I just haven't had time to take a stab at this recently.

@kmike SUM_TO_ONE is False for LidstoneProbDist because if you encounter an event that was not in the initial distribution and you didn't set the bins value to be the number of possible events, then it will not sum to one. But if used properly, it will indeed sum to one. The problem here is NgramModel's beta calculation, not LidstoneProbDist itself.

@kmike: Yeah, I noticed that SUM_TO_ONE was false. My concern was that the model was returning individual conditional probabilities (for single events) that were already greater than 1 -- before incorporating them into the summation.

@bcroy I think your solution is the right approach. Simply stated, _alpha performs two importants tasks:

  1. It renormalizes the backoff model for the given context so as to exclude words that are already accounted for by the current higher-order model.
  2. It scales the renormalized backoff model to "fit" the "missing" / discounted probability of the current _model.

That being said, it would be nice if the NgramModel also offered an interpolation strategy as an alternative to the backoff strategy. This would enable support for Jelinek-Mercer or Witten-Bell smoothing -- the latter of which I found to be simple, and to work quite well. See: http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

Can someone confirm that this is still an open bug?

Yes, I'm still getting P(foo | bar, baz) = 2.625

Hi everybody,

Is there any progress with this issue? Is it still an open bug? I am getting P(foo | bar, baz) = 2.625 so the problem continues.

I think this is an important problem and should have been fixed because language models are used for almost all applications in NLP.

Unfortunately, I have not had any time to look at the numerous issues with NgramModel, and I don't foresee myself being able to do so anytime soon. Until someone tackles these bugs, NgramModel has been removed from nltk.

Dan, thanks for the answer.

Just checking in for an update. I can see some issues have been closed but just want to make sure it's still far from being usable?

@ZeerakW Unfortunately, there has been little progress in ngram models, and no one has committed to tackle this yet.

Entering 2016 and the 'ngram model' issue has not had any advances.

Folks we can finally close this :)

Update 2018. Already graduated and started working and still the Ngram issue exists

Yay!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

talbaumel picture talbaumel  ·  4Comments

mwess picture mwess  ·  5Comments

DavidNemeskey picture DavidNemeskey  ·  4Comments

zdog234 picture zdog234  ·  3Comments

BLKSerene picture BLKSerene  ·  4Comments