Given weight = [0.25, 0.25, 0.25, 0.25] (default value),
sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c']) = 0
While sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd']) = 0.7598
Obviously the previous score should be larger than the latter, or both scores should be 0
Which version of the code are you using?
$ python
>>> import nltk
>>> nltk.__version__
'3.2.1'
The BLEU implementation has been just recently fixed with #1330 resolved. If you're using the develop
branch of nltk
, this should be the output:
>>> import nltk
>>> from nltk import bleu
>>> ref = hyp = 'abc'
>>> bleu([ref], hyp)
1.0
>>> from nltk import bleu
>>> ref, hyp = 'abc', 'abd'
>>> bleu([ref], hyp)
0.7598356856515925
Since a string is a list of chars and nltk
imports the sentence_bleu()
to the top-level imports, the code above is the same as:
>>> from nltk.translate.bleu_score import sentence_bleu
>>> sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c'])
1.0
>>> sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd'])
0.7598356856515925
To install the latest develop
branch, try:
pip install https://github.com/nltk/nltk/archive/develop.zip
(Do note that the develop branch is subjected to _more unexpected bugs_ and it is recommended that users install the master
or official release)
On a related note but not directly involved with the current nltk
implementation of bleu
, the previous implementation without the #1330 fix is subjected to the same flaws of the popular multi-bleu.perl
. Maybe you might find it interesting to know why it returned 0 without the recent fix: https://gist.github.com/alvations/e5922afa8c91472d25c58b2d712a93e7
Thanks @alvations . The original version of nltk I used was 3.2. I have updated it to 3.2.1 now and it's now raising ZeroDivisionError. And I used Python 3.5.2
The only stable version of BLEU is in the develop
branch. Please wait for it to be release in NLTK 3.2.2 or install the develop
branch (but do note that the development branch might be subjected to untested bugs).
OK. I will wait. But in the case you mentioned above, if the weight is [0.25, 0.25, 0.25, 0.25], the results of sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'c']) and sentence_bleu([['a', 'b', 'c']], ['a', 'b', 'd']) should both be 0, according to the original paper
The original paper didn't account for the fact that p_n
can be 0 if the length of reference/hypothesis is less than n
, see equation in Section 2.3 of http://www.aclweb.org/anthology/P02-1040.pdf. Because it was meant to be a corpus score, the possibility that there are references/hypotheses less than length n
was not covered in the paper.
If we look at the formula in Section 2.3, it takes the exp(log(p_n))
and when p_n
is 0, it gets into a math domain error because the logarithm function (i.e. y = log x
) has an asymptote at x = 0 , such that the range of x must be more than 0.
So if we were to implement the original BLEU, the user should receive a warning that says something like "BLEU can't be computed" whenever there is a the math domain error. So the later versions of BLEU tries to fix it with several different hacks, the history of the versions can be found on https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L17
Please note that the latest rendition of BLEU comes with the smoothing functions from Chen and Cherry (2014) paper is not in the Moses version of mteval.pl
.
I hope the explanation helps.
Most helpful comment
The original paper didn't account for the fact that
p_n
can be 0 if the length of reference/hypothesis is less thann
, see equation in Section 2.3 of http://www.aclweb.org/anthology/P02-1040.pdf. Because it was meant to be a corpus score, the possibility that there are references/hypotheses less than lengthn
was not covered in the paper.If we look at the formula in Section 2.3, it takes the
exp(log(p_n))
and whenp_n
is 0, it gets into a math domain error because the logarithm function (i.e.y = log x
) has an asymptote at x = 0 , such that the range of x must be more than 0.So if we were to implement the original BLEU, the user should receive a warning that says something like "BLEU can't be computed" whenever there is a the math domain error. So the later versions of BLEU tries to fix it with several different hacks, the history of the versions can be found on https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L17
Please note that the latest rendition of BLEU comes with the smoothing functions from Chen and Cherry (2014) paper is not in the Moses version of
mteval.pl
.I hope the explanation helps.