Nltk: Sentence tokenizer not splitting correctly

Created on 23 Nov 2015 · 5Comments · Source: nltk/nltk

I think there is a bug in standard sentence tokenizer sent_tokenize. The problem is, that it is not splitting text into sentences under certain case. Here is this case, where the tokenizer fails to split text into two sentences:

[sent for sent in nltk.sent_tokenize('Model wears size S. Fits size.')]

This returns ['Model wears size S. Fits size.'], instead of ['Model wears size S.', 'Fits size.']. The problem seems to appear, when the last string before . contains only one character. If the number of characters is >= 2, then it correctly splits the text.

inactive tokenizer

Source

jeryini

👍1

Most helpful comment

Just want to add a real world example from BookCorpus, extracted from "Three Plays", Published by Mike Suttons at Smashwords.

sent_tokenize('The weather is terrible, and my day was ok. You are supposed to take your medicine.')

Output

['The weather is terrible, and my day was ok. You are supposed to take your medicine.']

It confirmed that nltk didn't recognize k. as a sentence separator.

yoquankara on 12 Feb 2019

👍3

All 5 comments

This looks very hard to fix in sentence tokenizer if you consider that S. Fits may be a first and a last name of a person.

kmike on 23 Nov 2015

I think the way to go is to subclass or copy-paste default NLTK sentence tokenizer and modify it to fit your application. E.g. if you don't expect such person names in text then remove rules which handle person names. Another option is to use a workaround like replacing size <X> with size_<X> before tokenization and replacing them back again after text is split into sentences.

kmike on 23 Nov 2015

Hmmm. Just tried again. So the first case that I presented is not splitting correctly. But if I use different characters then it sometimes splits! That is why I wrote this quick test:

import nltk
import pprint

pp = pprint.PrettyPrinter(indent=4)
s = 'Test {}. Test {}.'
[nltk.sent_tokenize(s.format(char, char)) for char in 'abcdefghijklmnopqrstuvwxyz']
[pp.pprint(nltk.sent_tokenize(s.format(char, char))) for char in 'abcdefghijklmnopqrstuvwxyz']

Output:

['Test a.', 'Test a.']
['Test b.', 'Test b.']
['Test c. Test c.']
['Test d. Test d.']
['Test e. Test e.']
['Test f. Test f.']
['Test g. Test g.']
['Test h. Test h.']
['Test i.', 'Test i.']
['Test j.', 'Test j.']
['Test k. Test k.']
['Test l. Test l.']
['Test m. Test m.']
['Test n. Test n.']
['Test o.', 'Test o.']
['Test p. Test p.']
['Test q.', 'Test q.']
['Test r. Test r.']
['Test s. Test s.']
['Test t. Test t.']
['Test u.', 'Test u.']
['Test v. Test v.']
['Test w. Test w.']
['Test x.', 'Test x.']
['Test y.', 'Test y.']
['Test z.', 'Test z.']

@kmike, as you can see it is very inconsistent.

jeryini on 23 Nov 2015

@JernejJerin It's not a rule-based tokenizer so it wouldn't be able to control/explain the "rules" of splitting using regex-like explanation.

The algorithm used to train the sent_tokenizer is Kiss and Strunk (2006) punkt algorithm. It is a statistical system that tries to learn sentence boundary so it's not perfect but it's consistent with the probabilities generated from the model (but not necessary human-like rules).