I understand how difficult it is to split sentences that contain abbreviations and that adding abbreviations can have pitfalls, as it is nicely explained in #2154. However, I have stumbled upon some corner cases that I would like to ask about. It looks like using any of the following
in the sentence will split the sentence in a wrong way.
Example for i.e. and e.g.
>>> sentence = ("Even though exempli gratia and id est are both Latin "
"(and therefore italicized), no need to put e.g. or i.e. in "
"italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
'or i.e.',
'in italics when they’re in abbreviated form.']
Example for et al.
>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
"with the prototype. However, this is very unlikely be cause "
"they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
'get the financing, we can move forward with the prototype.',
'However, this is very unlikely because they did not publish sufficiently last year.']
On my laptop I am using nltk.__version__
3.4.5.
As I see it this issue is different than #2154 because these are well known and commonly used abbreviations (especially in academic circles).
Quick hack, following #2154
>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.',
'However, this is very unlikely be cause they did not publish sufficiently last year.']
But perhaps it's a good idea to have an improved sentence tokenizer (#2008, #1214) like what we did with the word tokenizer (#2355).
E.g. we can easily all the nltk.corpus.nonbreaking_prefixes
to the punkt._params.abbrev_types
as a first step.
I was about to add an issue along these lines, but saw that you've beaten me to it. If/when folks get around to fixing this, I'd suggest reviewing the list of Latin abbreviations here:
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations
Technically speaking those abbreviations in their fully written, non-abbreviated form, represent _multi-word expressions_ (_MWEs_), no? I mean, okay, technically they (also) represent fixed _phrasal templates_, but that doesn't change their MWE status, so, I wonder whether #2202 could help with this problem (albeit I feel like the answer will come down to "No") 🤔
(P.S., fun facts: '& al.' is a legal shortening of 'et al.', as '&c.' is a legal shortening of 'etc.')
Most helpful comment
Quick hack, following #2154
But perhaps it's a good idea to have an improved sentence tokenizer (#2008, #1214) like what we did with the word tokenizer (#2355).
E.g. we can easily all the
nltk.corpus.nonbreaking_prefixes
to thepunkt._params.abbrev_types
as a first step.