Nltk: 在某些极端情况下拆分句子失败

创建于 2019-08-26 · 3评论 · 资料来源: nltk/nltk

我理解拆分包含缩写的句子是多么困难，并且添加缩写可能会有陷阱，正如在 #2154 中很好地解释的那样。但是，我偶然发现了一些我想问的极端情况。看起来像使用以下任何一种

例如
IE
等。

在句子中会以错误的方式拆分句子。

ie 和 eg 的例子

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

等人的例子。

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

在我的笔记本电脑上，我使用的是nltk.__version__ 3.4.5。

在我看来，这个问题与#2154 不同，因为这些是众所周知且常用的缩写（尤其是在学术界）。

nice idea tokenizer

资料来源

vezeli

👍2

最有用的评论

快速破解，跟随#2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

但也许有一个改进的句子标记器（#2008，#1214）是个好主意，就像我们对单词标记器（#2355）所做的那样。

例如，我们可以轻松地将所有nltk.corpus.nonbreaking_prefixes转换为punkt._params.abbrev_types作为第一步。

alvations 于 2019-08-28

👍2

所有3条评论

快速破解，跟随#2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

但也许有一个改进的句子标记器（#2008，#1214）是个好主意，就像我们对单词标记器（#2355）所做的那样。

例如，我们可以轻松地将所有nltk.corpus.nonbreaking_prefixes转换为punkt._params.abbrev_types作为第一步。

alvations 于 2019-08-28

👍2

我正要按照这些思路添加一个问题，但看到你已经打败了我。如果/当人们解决这个问题时，我建议在这里查看拉丁语缩写列表：
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations

jrtuenge 于 2020-02-20

👍1

从技术上讲，那些完全书面的非缩写形式的缩写代表_多词表达式_（_MWEs_），不是吗？我的意思是，好吧，从技术上讲，他们（也）代表固定的_短语模板_，但这不会改变他们的 MWE 状态，所以，我想知道 #2202 是否可以帮助解决这个问题（尽管我觉得答案会归结为“不”）🤔

（PS，有趣的事实：“& al.”是“et al.”的合法缩写，因为“&c.”是“etc.”的合法缩写）

no-identd 于 2021-01-03

此页面是否有帮助？

0 / 5 - 0 等级

Nltk: 在某些极端情况下拆分句子失败

最有用的评论

所有3条评论

相关问题