Nltk: Splitting sentences fails on some corner cases

Created on 26 Aug 2019 · 3Comments · Source: nltk/nltk

I understand how difficult it is to split sentences that contain abbreviations and that adding abbreviations can have pitfalls, as it is nicely explained in #2154. However, I have stumbled upon some corner cases that I would like to ask about. It looks like using any of the following

e.g.
i.e.
et al.

in the sentence will split the sentence in a wrong way.

Example for i.e. and e.g.

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

Example for et al.

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

On my laptop I am using nltk.__version__ 3.4.5.

As I see it this issue is different than #2154 because these are well known and commonly used abbreviations (especially in academic circles).

nice idea tokenizer

Source

vezeli

👍2

Most helpful comment

Quick hack, following #2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

But perhaps it's a good idea to have an improved sentence tokenizer (#2008, #1214) like what we did with the word tokenizer (#2355).

E.g. we can easily all the nltk.corpus.nonbreaking_prefixes to the punkt._params.abbrev_types as a first step.

alvations on 28 Aug 2019

👍2

All 3 comments

Quick hack, following #2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

But perhaps it's a good idea to have an improved sentence tokenizer (#2008, #1214) like what we did with the word tokenizer (#2355).

E.g. we can easily all the nltk.corpus.nonbreaking_prefixes to the punkt._params.abbrev_types as a first step.

alvations on 28 Aug 2019

👍2

I was about to add an issue along these lines, but saw that you've beaten me to it. If/when folks get around to fixing this, I'd suggest reviewing the list of Latin abbreviations here:
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations

jrtuenge on 20 Feb 2020

👍1

Technically speaking those abbreviations in their fully written, non-abbreviated form, represent _multi-word expressions_ (_MWEs_), no? I mean, okay, technically they (also) represent fixed _phrasal templates_, but that doesn't change their MWE status, so, I wonder whether #2202 could help with this problem (albeit I feel like the answer will come down to "No") 🤔

(P.S., fun facts: '& al.' is a legal shortening of 'et al.', as '&c.' is a legal shortening of 'etc.')