nltk.translate.bleu_score gives false result when ngram larger than maximum ngrams of given sentence

Created on 9 Dec 2016  ·  5Comments  ·  Source: nltk/nltk

Given weight = [0.25, 0.25, 0.25, 0.25] (default value),
sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c']) = 0
While sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd']) = 0.7598
Obviously the previous score should be larger than the latter, or both scores should be 0

Most helpful comment

The original paper didn't account for the fact that p_n can be 0 if the length of reference/hypothesis is less than n, see equation in Section 2.3 of Because it was meant to be a corpus score, the possibility that there are references/hypotheses less than length n was not covered in the paper.

If we look at the formula in Section 2.3, it takes the exp(log(p_n)) and when p_n is 0, it gets into a math domain error because the logarithm function (i.e. y = log x) has an asymptote at x = 0 , such that the range of x must be more than 0.

So if we were to implement the original BLEU, the user should receive a warning that says something like "BLEU can't be computed" whenever there is a the math domain error. So the later versions of BLEU tries to fix it with several different hacks, the history of the versions can be found on

Please note that the latest rendition of BLEU comes with the smoothing functions from Chen and Cherry (2014) paper is not in the Moses version of

I hope the explanation helps.

All 5 comments

Which version of the code are you using?

$ python
>>> import nltk
>>> nltk.__version__

The BLEU implementation has been just recently fixed with #1330 resolved. If you're using the develop branch of nltk, this should be the output:

>>> import nltk
>>> from nltk import bleu
>>> ref = hyp = 'abc'
>>> bleu([ref], hyp)
>>> from nltk import bleu
>>> ref, hyp = 'abc', 'abd'
>>> bleu([ref], hyp)

Since a string is a list of chars and nltk imports the sentence_bleu() to the top-level imports, the code above is the same as:

>>> from nltk.translate.bleu_score import sentence_bleu
>>> sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c'])
>>> sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd'])

To install the latest develop branch, try:

pip install

(Do note that the develop branch is subjected to _more unexpected bugs_ and it is recommended that users install the master or official release)

On a related note but not directly involved with the current nltk implementation of bleu, the previous implementation without the #1330 fix is subjected to the same flaws of the popular multi-bleu.perl. Maybe you might find it interesting to know why it returned 0 without the recent fix:

Thanks @alvations . The original version of nltk I used was 3.2. I have updated it to 3.2.1 now and it's now raising ZeroDivisionError. And I used Python 3.5.2

The only stable version of BLEU is in the develop branch. Please wait for it to be release in NLTK 3.2.2 or install the develop branch (but do note that the development branch might be subjected to untested bugs).

OK. I will wait. But in the case you mentioned above, if the weight is [0.25, 0.25, 0.25, 0.25], the results of sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c']) and sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd']) should both be 0, according to the original paper

The original paper didn't account for the fact that p_n can be 0 if the length of reference/hypothesis is less than n, see equation in Section 2.3 of Because it was meant to be a corpus score, the possibility that there are references/hypotheses less than length n was not covered in the paper.

If we look at the formula in Section 2.3, it takes the exp(log(p_n)) and when p_n is 0, it gets into a math domain error because the logarithm function (i.e. y = log x) has an asymptote at x = 0 , such that the range of x must be more than 0.

So if we were to implement the original BLEU, the user should receive a warning that says something like "BLEU can't be computed" whenever there is a the math domain error. So the later versions of BLEU tries to fix it with several different hacks, the history of the versions can be found on

Please note that the latest rendition of BLEU comes with the smoothing functions from Chen and Cherry (2014) paper is not in the Moses version of

I hope the explanation helps.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alvations picture alvations  ·  4Comments

alvations picture alvations  ·  4Comments

stevenbird picture stevenbird  ·  3Comments

vezeli picture vezeli  ·  3Comments

goodmami picture goodmami  ·  4Comments