Nltk: El parámetro de idioma no se pasa en nltk.tag . init . Pos_tag_sents ()

Creado en 20 nov. 2018 · 5Comentarios · Fuente: nltk/nltk

El parámetro lang de pos_tag_sents () en nltk / tag / __ init__.py no se está pasando.

Junto con el cambio al orden de excepción en la confirmación 69583ceaaaff7e51dd9f07f4f226d3a2b75bea69 (líneas 110-116 de nltk / tag / __ init__.py), esto ahora da como resultado un error de "NotImplementedError ('Actualmente, NLTK pos_tag solo admite inglés y ruso (es decir, lang =' eng 'or lang =' rus ')' "al etiquetar una oración.

Fuente

edjzhang

Comentario más útil

El último lanzamiento es el 17, mientras que este se fusionó después.

AllanWang en 14 feb. 2019

👍3

Todos 5 comentarios

Solicitud de extracción: https://github.com/nltk/nltk/pull/2186

AndrewOwenMartin en 20 nov. 2018

Gracias @ezhangsfl

stevenbird en 21 nov. 2018

Sigo recibiendo este error a pesar de que he actualizado a los archivos más recientes y también he intentado agregar manualmente el parámetro lang='eng' , pero esto tampoco funcionó. @ezhangsfll @stevenbird

carnesca en 7 feb. 2019

El último lanzamiento es el 17, mientras que este se fusionó después.

AllanWang en 14 feb. 2019

👍3

Sigo recibiendo este error a pesar de que he actualizado a los archivos más recientes y también he intentado agregar manualmente el parámetro lang='eng' , pero esto tampoco funcionó. @ezhangsfll @stevenbird

Reemplace el contenido del archivo (__init__.py) con lo siguiente:

- - codificación: utf-8 - -

Kit de herramientas de lenguaje natural: etiquetadores

Copyright (C) 2001-2019 Proyecto NLTK

Autor: Edward Loper [email protected]

Steven Bird [email protected] (adiciones menores)

URL: http://nltk.org/

Para obtener información sobre la licencia, consulte LICENSE.TXT

"" "
Etiquetadoras NLTK

Este paquete contiene clases e interfaces para parte del discurso.
etiquetado, o simplemente "etiquetado".

Una "etiqueta" es una cadena que distingue entre mayúsculas y minúsculas que especifica alguna propiedad de un token,
como su parte del discurso. Los tokens etiquetados se codifican como tuplas
(tag, token) . Por ejemplo, el siguiente token etiquetado combina
la palabra 'fly' con una parte sustantiva de la etiqueta del discurso ( 'NN' ):

>>> tagged_tok = ('fly', 'NN')

Un etiquetador estándar está disponible para inglés. Utiliza el conjunto de etiquetas Penn Treebank:

>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

Un etiquetador ruso también está disponible si especifica lang = "rus". Usa
el conjunto de etiquetas Russian National Corpus:

>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus')    # doctest: +SKIP
[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),
('бумажку', 'S'), ('.', 'NONLEX')]

Este paquete define varios etiquetadores, que toman una lista de tokens,
asigne una etiqueta a cada uno y devuelva la lista resultante de tokens etiquetados.
La mayoría de los etiquetadores se crean automáticamente en función de un corpus de formación.
Por ejemplo, el etiquetador unigrama etiqueta cada palabra w comprobando qué
la etiqueta más frecuente para w estaba en un corpus de entrenamiento:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

Tenga en cuenta que las palabras que el etiquetador no ha visto durante el entrenamiento reciben una etiqueta
de None .

Evaluamos un etiquetador con datos que no se vieron durante el entrenamiento:

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.73...

Para obtener más información, consulte el capítulo 5 del Libro NLTK.
"" "
de __future__ import print_function

de nltk.tag.api importar TaggerI
de nltk.tag.util import str2tuple, tuple2str, untag
de nltk.tag.sequential import (
SequentialBackoffTagger,
ContextTagger,
DefaultTagger,
NgramTagger,
UnigramTagger,
BigramTagger,
TrigramTagger,
AffixTagger,
RegexpTagger,
ClassifierBasedTagger,
ClassifierBasedPOSTagger,
)
de nltk.tag.brill importar BrillTagger
de nltk.tag.brill_trainer importar BrillTaggerTrainer
de nltk.tag.tnt importar TnT
de nltk.tag.hunpos importar HunposTagger
de nltk.tag.stanford import StanfordTagger, StanfordPOSTagger, StanfordNERTagger
de nltk.tag.hmm importar HiddenMarkovModelTagger, HiddenMarkovModelTrainer
de nltk.tag.senna importar SennaTagger, SennaChunkTagger, SennaNERTagger
de nltk.tag.mapping import tagset_mapping, map_tag
de nltk.tag.crf importar CRFTagger
de nltk.tag.perceptron importar PerceptronTagger

desde la carga de importación nltk.data, busque

RUS_PICKLE = (
'etiquetadores / averaged_perceptron_tagger_ru / averaged_perceptron_tagger_ru.pickle'
)

def _get_tagger (lang = Ninguno):
if lang == 'rus':
tagger = PerceptronTagger (Falso)
ap_russian_model_loc = 'archivo:' + str (buscar (RUS_PICKLE))
tagger.load (ap_russian_model_loc)
demás:
etiquetador = PerceptronTagger ()
etiquetador de retorno

def _pos_tag (tokens, tagset = None, tagger = None, lang = None):
# Actualmente solo admite inglés y ruso.
si lang no está en ['eng', 'rus']:
elevar NotImplementedError (
"Actualmente, NLTK pos_tag solo admite inglés y ruso"
"(es decir, lang = 'eng' o lang = 'rus')"
)
demás:
tagged_tokens = tagger.tag (tokens)
if tagset: # Se asigna al conjunto de etiquetas especificado.
if lang == 'eng':
tagged_tokens = [
(token, map_tag ('en-ptb', conjunto de etiquetas, etiqueta))
para (token, etiqueta) en tagged_tokens
]
elif lang == 'rus':
# Tenga en cuenta que las nuevas etiquetas pos Russion del modelo contienen sufijos,
# ver https://github.com/nltk/nltk/issues/2151#issuecomment -430709018
tagged_tokens = [
(token, map_tag ('ru-rnc-new', tagset, tag.partition ('=') [0]))
para (token, etiqueta) en tagged_tokens
]
devolver tagged_tokens

def pos_tag (tokens, tagset = None, lang = 'eng'):
"" "
Utilice el etiquetador de parte del discurso recomendado actualmente por NLTK para
etiquetar la lista de tokens dada.

    >>> from nltk.tag import pos_tag
    >>> from nltk.tokenize import word_tokenize
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
    ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
    [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
    ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.

:param tokens: Sequence of tokens to be tagged
:type tokens: list(str)
:param tagset: the tagset to be used, e.g. universal, wsj, brown
:type tagset: str
:param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
:type lang: str
:return: The tagged tokens
:rtype: list(tuple(str, str))
"""
tagger = _get_tagger(lang)
return _pos_tag(tokens, tagset, tagger, lang)

def pos_tag_sents (oraciones, tagset = None, lang = 'eng'):
"" "
Utilice el etiquetador de parte del discurso recomendado actualmente por NLTK para etiquetar el
dada lista de oraciones, cada una de las cuales consta de una lista de tokens.

:param tokens: List of sentences to be tagged
:type tokens: list(list(str))
:param tagset: the tagset to be used, e.g. universal, wsj, brown
:type tagset: str
:param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
:type lang: str
:return: The list of tagged sentences
:rtype: list(list(tuple(str, str)))
"""
tagger = _get_tagger(lang)
return [_pos_tag(sent, tagset, tagger, lang) for sent in sentences]