Nltk: La división de oraciones falla en algunos casos de esquina

Creado en 26 ago. 2019 · 3Comentarios · Fuente: nltk/nltk

Entiendo lo difícil que es dividir oraciones que contienen abreviaturas y que agregar abreviaturas puede tener dificultades, como se explica muy bien en el n. ° 2154. Sin embargo, me he encontrado con algunos casos de esquina sobre los que me gustaría preguntar. Parece que usa cualquiera de los siguientes

p.ej
es decir
et al.

en la oración dividirá la oración de manera incorrecta.

Ejemplo de ie y eg

>>> sentence = ("Even though exempli gratia and id est are both Latin "
                "(and therefore italicized), no need to put e.g. or i.e. in "
                "italics when they’re in abbreviated form.")
>>> sent_tokenize_list = sent_tokenize(sentence)                                                                                                                           

>>> sent_tokenize_list                                                                                                                                            
['Even though exempli gratia and id est are both Latin (and therefore italicized), no need to put e.g.',
 'or i.e.',
 'in italics when they’re in abbreviated form.']

Ejemplo de et al.

>>> from nltk.tokenize import sent_tokenize
>>> sentence = ("If David et al. get the financing, we can move forward "
                "with the prototype. However, this is very unlikely be cause "
                "they did not publish sufficiently last year.")
>>> sent_tokenize_list = sent_tokenize(sentence)
>>> sent_tokenize_list
['If David et al.',
 'get the financing, we can move forward with the prototype.',
 'However, this is very unlikely because they did not publish sufficiently last year.']

En mi computadora portátil estoy usando nltk.__version__ 3.4.5.

A mi modo de ver, este número es diferente al # 2154 porque son abreviaturas bien conocidas y de uso común (especialmente en círculos académicos).

nice idea tokenizer

Fuente

vezeli

👍2

Comentario más útil

Hack rápido, siguiendo a # 2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

Pero quizás sea una buena idea tener un tokenizador de oraciones mejorado (# 2008, # 1214) como lo que hicimos con la palabra tokenizer (# 2355).

Por ejemplo, podemos fácilmente todos los nltk.corpus.nonbreaking_prefixes hasta los punkt._params.abbrev_types como primer paso.

alvations en 28 ago. 2019

👍2

Todos 3 comentarios

Hack rápido, siguiendo a # 2154

>>> import nltk
>>> punkt = nltk.data.load('tokenizers/punkt/english.pickle')
>>> punkt._params.abbrev_types.add('al')
>>> text = 'If David et al. get the financing, we can move forward with the prototype. However, this is very unlikely be cause they did not publish sufficiently last year.'
>>> punkt.tokenize(text)
['If David et al. get the financing, we can move forward with the prototype.', 
'However, this is very unlikely be cause they did not publish sufficiently last year.']

Pero quizás sea una buena idea tener un tokenizador de oraciones mejorado (# 2008, # 1214) como lo que hicimos con la palabra tokenizer (# 2355).

Por ejemplo, podemos fácilmente todos los nltk.corpus.nonbreaking_prefixes hasta los punkt._params.abbrev_types como primer paso.

alvations en 28 ago. 2019

👍2

Estaba a punto de agregar un problema en este sentido, pero vi que me habías adelantado. Si / cuando la gente solucione esto, le sugiero que revise la lista de abreviaturas latinas aquí:
https://en.wikipedia.org/wiki/List_of_Latin_abbreviations

jrtuenge en 20 feb. 2020

👍1

Técnicamente hablando, esas abreviaturas en su forma no abreviada, completamente escrita, representan _expresiones de varias palabras_ (_MWEs_), ¿no? Quiero decir, está bien, técnicamente (también) representan _plantillas de frases fijas_, pero eso no cambia su estado MWE, así que me pregunto si el # 2202 podría ayudar con este problema (aunque siento que la respuesta se reducirá a "No ") 🤔

(PD, datos curiosos: '& al.' Es una abreviatura legal de 'et al.', Ya que '& c.' Es una abreviatura legal de 'etc.')