Nltk: 在某些极端情况下拆分句子失败

创建于 2019-08-26  ·  3评论  ·  资料来源: nltk/nltk

我理解拆分包含缩写的句子是多么困难,并且添加缩写可能会有陷阱,正如在 #2154 中很好地解释的那样。 但是,我偶然发现了一些我想问的极端情况。 看起来像使用以下任何一种

  • 例如
  • IE
  • 等。

在句子中会以错误的方式拆分句子。

ie 和 eg 的例子

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

等人的例子。

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

在我的笔记本电脑上,我使用的是nltk.__version__ 3.4.5。

在我看来,这个问题与#2154 不同,因为这些是众所周知且常用的缩写(尤其是在学术界)。

nice idea tokenizer

最有用的评论

快速破解,跟随#2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

但也许有一个改进的句子标记器(#2008,#1214)是个好主意,就像我们对单词标记器(#2355)所做的那样。

例如,我们可以轻松地将所有nltk.corpus.nonbreaking_prefixes转换为punkt._params.abbrev_types作为第一步。

所有3条评论

快速破解,跟随#2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

但也许有一个改进的句子标记器(#2008,#1214)是个好主意,就像我们对单词标记器(#2355)所做的那样。

例如,我们可以轻松地将所有nltk.corpus.nonbreaking_prefixes转换为punkt._params.abbrev_types作为第一步。

我正要按照这些思路添加一个问题,但看到你已经打败了我。 如果/当人们解决这个问题时,我建议在这里查看拉丁语缩写列表:
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations

从技术上讲,那些完全书面的非缩写形式的缩写代表_多词表达式_(_MWEs_),不是吗? 我的意思是,好吧,从技术上讲,他们(也)代表固定的_短语模板_,但这不会改变他们的 MWE 状态,所以,我想知道 #2202 是否可以帮助解决这个问题(尽管我觉得答案会归结为“不”)🤔

(PS,有趣的事实:“& al.”是“et al.”的合法缩写,因为“&c.”是“etc.”的合法缩写)

此页面是否有帮助?
0 / 5 - 0 等级

相关问题

albertauyeung picture albertauyeung  ·  14评论

grayben picture grayben  ·  32评论

f0lie picture f0lie  ·  21评论

rain1024 picture rain1024  ·  22评论

oxymor0n picture oxymor0n  ·  22评论