Nltk: 一部のコーナーケースで文の分割が失敗する

作成日 2019年08月26日 · 3コメント · ソース: nltk/nltk

＃2154でうまく説明されているように、略語を含む文を分割するのがいかに難しいか、および略語を追加すると落とし穴が生じる可能性があることを理解しています。しかし、私が聞きたいいくつかのコーナーケースに出くわしました。次のいずれかを使用しているようです

例えば
すなわち
etal。

文中の文は間違った方法で分割されます。

ieとegの例

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

etal。の例

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

私のラップトップでは、 nltk.__version__ 3.4.5を使用しています。

私が見ているように、この問題は＃2154とは異なります。これは、これらがよく知られており、一般的に使用されている略語であるためです（特に学界で）。

nice idea tokenizer

ソース

vezeli

👍2

最も参考になるコメント

クイックハック、＃2154に続く

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

しかし、おそらく、単語トークナイザー（＃2355）で行ったように、改良されたセンテンストークナイザー（＃2008、＃1214）を使用することをお勧めします。

たとえば、最初のステップとして、すべてのnltk.corpus.nonbreaking_prefixesからpunkt._params.abbrev_typesまでを簡単に作成できます。

alvations 2019年08月28日

👍2

全てのコメント3件

クイックハック、＃2154に続く

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

たとえば、最初のステップとして、すべてのnltk.corpus.nonbreaking_prefixesからpunkt._params.abbrev_typesまでを簡単に作成できます。

alvations 2019年08月28日

👍2

私はこれらの線に沿って問題を追加しようとしていましたが、あなたがそれに私を打ち負かしたのを見ました。人々がこれを修正することに取り掛かった場合、私はここでラテン語の略語のリストを確認することをお勧めします：
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations

jrtuenge 2020年02月20日

👍1

技術的に言えば、完全に記述された、省略されていない形式のこれらの省略形は、_複数単語の表現_（_ MWEs_）を表します。つまり、技術的には、それらは（また）固定された_フレーズテンプレート_を表しますが、それによってMWEステータスが変わることはないので、＃2202がこの問題に役立つかどうか疑問に思います（答えは「いいえ"）🤔

（PS、おもしろい事実：「＆c。」は「etc.」の法的な短縮であるため、「＆al。」は「etal。」の法的な短縮です）

no-identd 2021年01月03日

このページは役に立ちましたか？

0 / 5 - 0 評価

Nltk: 一部のコーナーケースで文の分割が失敗する

最も参考になるコメント

全てのコメント3件

関連する問題