Nltk: O parâmetro de idioma não está sendo passado em nltk.tag . init . Pos_tag_sents ()

Criado em 20 nov. 2018 · 5Comentários · Fonte: nltk/nltk

O parâmetro lang de pos_tag_sents () em nltk / tag / __ init__.py não está sendo passado.

Juntamente com a mudança na ordem de exceção no commit 69583ceaaaff7e51dd9f07f4f226d3a2b75bea69 (linhas 110-116 de nltk / tag / __ init__.py), isso agora resulta em um erro de "NotImplementedError ('Atualmente, NLTK = pos_tag' suporta apenas inglês e russo (isto é, lang eng 'or lang =' rus ')' "ao marcar uma frase.

Fonte

edjzhang

Comentários muito úteis

O último lançamento é o dia 17, mas foi fundido posteriormente

AllanWang em 14 fev. 2019

👍3

Todos 5 comentários

Solicitação de pull: https://github.com/nltk/nltk/pull/2186

AndrewOwenMartin em 20 nov. 2018

Obrigado @ezhangsfl

stevenbird em 21 nov. 2018

Ainda estou recebendo esse erro, embora tenha atualizado para os arquivos mais recentes e também tenha tentado adicionar manualmente o parâmetro lang='eng' , mas isso também não funcionou. @ezhangsfll @stevenbird

carnesca em 7 fev. 2019

O último lançamento é o dia 17, mas foi fundido posteriormente

AllanWang em 14 fev. 2019

👍3

Ainda estou recebendo esse erro, embora tenha atualizado para os arquivos mais recentes e também tenha tentado adicionar manualmente o parâmetro lang='eng' , mas isso também não funcionou. @ezhangsfll @stevenbird

Substitua o conteúdo do arquivo (__init__.py) pelo seguinte:

- - codificação: utf-8 - -

Kit de ferramentas de linguagem natural: Taggers

Copyright (C) 2001-2019 Projeto NLTK

Autor: Edward Loper [email protected]

Steven Bird [email protected] (acréscimos menores)

URL: http://nltk.org/

Para obter informações sobre a licença, consulte LICENSE.TXT

"" "
Taggers NLTK

Este pacote contém classes e interfaces para classes gramaticais
etiquetagem ou simplesmente "etiquetagem".

Uma "tag" é uma string que diferencia maiúsculas de minúsculas que especifica alguma propriedade de um token,
como sua classe gramatical. Tokens marcados são codificados como tuplas
(tag, token) . Por exemplo, o seguinte token marcado combina
a palavra 'fly' com uma etiqueta de classe gramatical de substantivo ( 'NN' ):

>>> tagged_tok = ('fly', 'NN')

Um tagger pronto para uso está disponível em inglês. Ele usa o conjunto de tags Penn Treebank:

>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

Um tagger russo também está disponível se você especificar lang = "rus". Usa
o conjunto de tags Russian National Corpus:

>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus')    # doctest: +SKIP
[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),
('бумажку', 'S'), ('.', 'NONLEX')]

Este pacote define vários taggers, que levam uma lista de tokens,
atribua uma tag a cada um e retorne a lista resultante de tokens marcados.
A maioria dos taggers são construídos automaticamente com base em um corpus de treinamento.
Por exemplo, o unigram tagger marca cada palavra w verificando o que
a tag mais frequente para w estava em um corpus de treinamento:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

Observe que as palavras que o tagger não viu durante o treinamento recebem uma tag
de None .

Avaliamos um tagger em dados que não foram vistos durante o treinamento:

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.73...

Para obter mais informações, consulte o capítulo 5 do livro NLTK.
"" "
from __future__ import print_function

de nltk.tag.api import TaggerI
de nltk.tag.util import str2tuple, tuple2str, untag
de importação nltk.tag.sequential (
SequentialBackoffTagger,
ContextTagger,
DefaultTagger,
NgramTagger,
UnigramTagger,
BigramTagger,
TrigramTagger,
AffixTagger,
RegexpTagger,
ClassifierBasedTagger,
ClassifierBasedPOSTagger,
)
from nltk.tag.brill import BrillTagger
de nltk.tag.brill_trainer import BrillTaggerTrainer
de nltk.tag.tnt import TnT
de nltk.tag.hunpos import HunposTagger
from nltk.tag.stanford import StanfordTagger, StanfordPOSTagger, StanfordNERTagger
from nltk.tag.hmm import HiddenMarkovModelTagger, HiddenMarkovModelTrainer
from nltk.tag.senna import SennaTagger, SennaChunkTagger, SennaNERTagger
de nltk.tag.mapping import tagset_mapping, map_tag
de nltk.tag.crf import CRFTagger
de nltk.tag.perceptron import PerceptronTagger

de nltk.data import load, find

RUS_PICKLE = (
'taggers / averaged_perceptron_tagger_ru / averaged_perceptron_tagger_ru.pickle'
)

def _get_tagger (lang = None):
if lang == 'rus':
tagger = PerceptronTagger (False)
ap_russian_model_loc = 'arquivo:' + str (encontrar (RUS_PICKLE))
tagger.load (ap_russian_model_loc)
outro:
tagger = PerceptronTagger ()
devolver tagger

def _pos_tag (tokens, tagset = None, tagger = None, lang = None):
# Atualmente suporta apenas inglês e russo.
se lang não estiver em ['eng', 'rus']:
raise NotImplementedError (
"Atualmente, NLTK pos_tag suporta apenas inglês e russo"
"(ou seja, lang = 'eng' ou lang = 'rus')"
)
outro:
tagged_tokens = tagger.tag (tokens)
if tagset: # Mapeia para o tagset especificado.
if lang == 'eng':
tagged_tokens = [
(token, map_tag ('en-ptb', conjunto de tags, tag))
para (token, tag) em tagged_tokens
]
elif lang == 'rus':
# Observe que as novas tags pos Russion do modelo contém sufixos,
# consulte https://github.com/nltk/nltk/issues/2151#issuecomment -430709018
tagged_tokens = [
(token, map_tag ('ru-rnc-new', tagset, tag.partition ('=') [0]))
para (token, tag) em tagged_tokens
]
retornar tagged_tokens

def pos_tag (tokens, tagset = None, lang = 'eng'):
"" "
Use o etiquetador de classe gramatical da NLTK atualmente recomendado para
marcar a lista de tokens fornecida.

    >>> from nltk.tag import pos_tag
    >>> from nltk.tokenize import word_tokenize
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
    ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
    [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
    ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.

:param tokens: Sequence of tokens to be tagged
:type tokens: list(str)
:param tagset: the tagset to be used, e.g. universal, wsj, brown
:type tagset: str
:param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
:type lang: str
:return: The tagged tokens
:rtype: list(tuple(str, str))
"""
tagger = _get_tagger(lang)
return _pos_tag(tokens, tagset, tagger, lang)

def pos_tag_sents (sentenças, tagset = None, lang = 'eng'):
"" "
Use o tagger de classe gramatical da NLTK atualmente recomendado para marcar o
dada lista de sentenças, cada uma consistindo em uma lista de tokens.

:param tokens: List of sentences to be tagged
:type tokens: list(list(str))
:param tagset: the tagset to be used, e.g. universal, wsj, brown
:type tagset: str
:param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
:type lang: str
:return: The list of tagged sentences
:rtype: list(list(tuple(str, str)))
"""
tagger = _get_tagger(lang)
return [_pos_tag(sent, tagset, tagger, lang) for sent in sentences]