Nltk: A divisão de sentenças falha em alguns casos esquivos

Criado em 26 ago. 2019 · 3Comentários · Fonte: nltk/nltk

Eu entendo como é difícil dividir frases que contenham abreviações e que adicionar abreviações pode ter armadilhas, como está bem explicado em # 2154. No entanto, descobri alguns casos remotos sobre os quais gostaria de perguntar. Parece que está usando qualquer um dos seguintes

por exemplo
ie
et al.

na frase irá dividir a frase de forma errada.

Exemplo para ie e eg

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

Exemplo para et al.

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

No meu laptop, estou usando nltk.__version__ 3.4.5.

A meu ver, este problema é diferente de # 2154 porque são abreviações bem conhecidas e comumente usadas (especialmente em círculos acadêmicos).

nice idea tokenizer

Fonte

vezeli

👍2

Comentários muito úteis

Hack rápido, seguindo # 2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

Mas talvez seja uma boa ideia ter um tokenizador de frase aprimorado (# 2008, # 1214) como o que fizemos com o tokenizador de palavras (# 2355).

Por exemplo, podemos facilmente transformar nltk.corpus.nonbreaking_prefixes em punkt._params.abbrev_types como uma primeira etapa.

alvations em 28 ago. 2019

👍2

Todos 3 comentários

Hack rápido, seguindo # 2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

Mas talvez seja uma boa ideia ter um tokenizador de frase aprimorado (# 2008, # 1214) como o que fizemos com o tokenizador de palavras (# 2355).

Por exemplo, podemos facilmente transformar nltk.corpus.nonbreaking_prefixes em punkt._params.abbrev_types como uma primeira etapa.

alvations em 28 ago. 2019

👍2

Eu estava prestes a adicionar um problema nesse sentido, mas vi que você me superou. Se / quando as pessoas resolverem consertar isso, sugiro revisar a lista de abreviações latinas aqui:
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations

jrtuenge em 20 fev. 2020

👍1

Tecnicamente falando, essas abreviações em sua forma não abreviada totalmente escrita, representam _expressões com várias palavras_ (_MWEs_), não? Quer dizer, ok, tecnicamente eles (também) representam _modelos de frase_ fixos, mas isso não muda seu status de MWE, então, eu me pergunto se o nº 2202 poderia ajudar com este problema (embora eu sinta que a resposta será "Não ") 🤔

(PS, curiosidades: '& al.' É uma abreviação legal de 'et al.', Assim como '& c.' É uma abreviatura legal de 'etc.')