Nltk: nltk.tag.__init__.pos_tag_sents()์—์„œ ์–ธ์–ด ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์ „๋‹ฌ๋˜์ง€ ์•Š์Œ

์— ๋งŒ๋“  2018๋…„ 11์›” 20์ผ  ยท  5์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: nltk/nltk

nltk/tag/__init__.py์— ์žˆ๋Š” pos_tag_sents()์˜ lang ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์ „๋‹ฌ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์ปค๋ฐ‹ 69583ceaaaff7e51dd9f07f4f226d3a2b75bea69(nltk/tag/__init__.py์˜ 110-116ํ–‰)์˜ ์˜ˆ์™ธ ์ˆœ์„œ ๋ณ€๊ฒฝ๊ณผ ๊ฒฐํ•ฉ๋˜์–ด ์ด์ œ "NotImplementedError('Currently, NLTK=' eng' ๋˜๋Š” lang='rus')'" ๋ฌธ์žฅ์„ ํƒœ๊ทธํ•  ๋•Œ.

๊ฐ€์žฅ ์œ ์šฉํ•œ ๋Œ“๊ธ€

๋งˆ์ง€๋ง‰ ๋ฆด๋ฆฌ์Šค๋Š” 17 ๋ฒˆ์งธ์ด์ง€๋งŒ ๋‚˜์ค‘์— ๋ณ‘ํ•ฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋“  5 ๋Œ“๊ธ€

ํ’€ ๋ฆฌํ€˜์ŠคํŠธ: https://github.com/nltk/nltk/pull/2186

@ezhangsfl ๊ฐ์‚ฌ

์ตœ์‹  ํŒŒ์ผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๊ณ  lang='eng' ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ถ”๊ฐ€ํ•˜๋ ค๊ณ  ์‹œ๋„ํ–ˆ์ง€๋งŒ ์ด ์˜ค๋ฅ˜๊ฐ€ ๊ณ„์† ํ‘œ์‹œ๋˜์ง€๋งŒ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. @ezhangsflll @stevenbird

๋งˆ์ง€๋ง‰ ๋ฆด๋ฆฌ์Šค๋Š” 17 ๋ฒˆ์งธ์ด์ง€๋งŒ ๋‚˜์ค‘์— ๋ณ‘ํ•ฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ตœ์‹  ํŒŒ์ผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๊ณ  lang='eng' ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ถ”๊ฐ€ํ•˜๋ ค๊ณ  ์‹œ๋„ํ–ˆ์ง€๋งŒ ์ด ์˜ค๋ฅ˜๊ฐ€ ๊ณ„์† ํ‘œ์‹œ๋˜์ง€๋งŒ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. @ezhangsflll @stevenbird

(__init__.py) ํŒŒ์ผ์˜ ๋‚ด์šฉ์„ ๋‹ค์Œ์œผ๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.

- - ์ฝ”๋”ฉ: utf-8 - -

์ž์—ฐ์–ด ๋„๊ตฌ ํ‚คํŠธ: ํƒœ๊ฑฐ

#

Copyright (C) 2001-2019 NLTK ํ”„๋กœ์ ํŠธ

์ €์ž: ์—๋“œ์›Œ๋“œ ๋กœํผ [email protected]

Steven Bird [email protected] (์‚ฌ์†Œํ•œ ์ถ”๊ฐ€ ์‚ฌํ•ญ)

URL: http://nltk.org/

๋ผ์ด์„ ์Šค ์ •๋ณด๋Š” LICENSE.TXT๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

""
NLTK ํƒœ๊ฑฐ

์ด ํŒจํ‚ค์ง€์—๋Š” ํ’ˆ์‚ฌ์— ๋Œ€ํ•œ ํด๋ž˜์Šค์™€ ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
ํƒœ๊น…, ๋˜๋Š” ๋‹จ์ˆœํžˆ "ํƒœ๊น…".

"ํƒœ๊ทธ"๋Š” ํ† ํฐ์˜ ์ผ๋ถ€ ์†์„ฑ์„ ์ง€์ •ํ•˜๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค.
๊ทธ๊ฒƒ์˜ ํ’ˆ์‚ฌ์™€ ๊ฐ™์€. ํƒœ๊ทธ๊ฐ€ ์ง€์ •๋œ ํ† ํฐ์€ ํŠœํ”Œ๋กœ ์ธ์ฝ”๋”ฉ๋ฉ๋‹ˆ๋‹ค.
(tag, token) . ์˜ˆ๋ฅผ ๋“ค์–ด ๋‹ค์Œ ํƒœ๊ทธ๊ฐ€ ์ง€์ •๋œ ํ† ํฐ์€
๋ช…์‚ฌ ํ’ˆ์‚ฌ ํƒœ๊ทธ๊ฐ€ ์žˆ๋Š” 'fly' ๋‹จ์–ด( 'NN' ):

>>> tagged_tok = ('fly', 'NN')

๊ธฐ์„ฑํ’ˆ ํƒœ๊ฑฐ๋Š” ์˜์–ด๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Penn Treebank ํƒœ๊ทธ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

lang="rus"๋ฅผ ์ง€์ •ํ•˜๋ฉด ๋Ÿฌ์‹œ์•„์–ด ํƒœ๊ฑฐ๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์€ ์‚ฌ์šฉ
๋Ÿฌ์‹œ์•„ ๊ตญ๋ฆฝ ์ฝ”ํผ์Šค ํƒœ๊ทธ ์„ธํŠธ:

>>> pos_tag(word_tokenize("ะ˜ะปัŒั ะพั‚ะพั€ะพะฟะตะป ะธ ะดะฒะฐะถะดั‹ ะฟะตั€ะตั‡ะธั‚ะฐะป ะฑัƒะผะฐะถะบัƒ."), lang='rus')    # doctest: +SKIP
[('ะ˜ะปัŒั', 'S'), ('ะพั‚ะพั€ะพะฟะตะป', 'V'), ('ะธ', 'CONJ'), ('ะดะฒะฐะถะดั‹', 'ADV'), ('ะฟะตั€ะตั‡ะธั‚ะฐะป', 'V'),
('ะฑัƒะผะฐะถะบัƒ', 'S'), ('.', 'NONLEX')]

์ด ํŒจํ‚ค์ง€๋Š” ํ† ํฐ ๋ชฉ๋ก์„ ๊ฐ€์ ธ์˜ค๋Š” ์—ฌ๋Ÿฌ ํƒœ๊ฑฐ๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
๊ฐ๊ฐ์— ํƒœ๊ทธ๋ฅผ ํ• ๋‹นํ•˜๊ณ  ํƒœ๊ทธ๊ฐ€ ์ง€์ •๋œ ํ† ํฐ์˜ ๊ฒฐ๊ณผ ๋ชฉ๋ก์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
๋Œ€๋ถ€๋ถ„์˜ ํƒœ๊ฑฐ๋Š” ํ›ˆ๋ จ ์ฝ”ํผ์Šค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž๋™์œผ๋กœ ๊ตฌ์ถ•๋ฉ๋‹ˆ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด, unigram ํƒœ๊ฑฐ๋Š” ๋‹ค์Œ์„ ํ™•์ธํ•˜์—ฌ ๊ฐ ๋‹จ์–ด w ์— ํƒœ๊ทธ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
w ์— ๋Œ€ํ•œ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•œ ํƒœ๊ทธ๋Š” ํ›ˆ๋ จ ๋ง๋ญ‰์น˜์— ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

ํ•™์Šต ์ค‘์— ํƒœ๊ทธ๊ฐ€ ํ‘œ์‹œ๋˜์ง€ ์•Š์€ ๋‹จ์–ด๋Š” ํƒœ๊ทธ๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค.
None .

ํ›ˆ๋ จ ์ค‘์— ๋ณผ ์ˆ˜ ์—†์—ˆ๋˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํƒœ๊ฑฐ๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.73...

์ž์„ธํ•œ ๋‚ด์šฉ์€ NLTK ์ฑ…์˜ 5์žฅ์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.
""
__future__์—์„œ print_function ๊ฐ€์ ธ์˜ค๊ธฐ

nltk.tag.api์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ TaggerI
nltk.tag.util์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ str2tuple, tuple2str, ํƒœ๊ทธ ํ•ด์ œ
nltk.tag.sequential์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ(
์ˆœ์ฐจ ๋ฐฑ์˜คํ”„ ํƒœ๊ฑฐ,
์ปจํ…์ŠคํŠธ ํƒœ๊ฑฐ,
๊ธฐ๋ณธ ํƒœ๊ฑฐ,
์—”๊ทธ๋žจํƒœ๊ฑฐ,
์œ ๋‹ˆ๊ทธ๋žจ ํƒœ๊ฑฐ,
๋น…๊ทธ๋žจ ํƒœ๊ฑฐ,
ํŠธ๋ผ์ด๊ทธ๋žจ ํƒœ๊ฑฐ,
์ ‘๋ฏธ์‚ฌ ํƒœ๊ฑฐ,
์ •๊ทœ์‹ ํƒœ๊ฑฐ,
๋ถ„๋ฅ˜๊ธฐ ๊ธฐ๋ฐ˜ ํƒœ๊ฑฐ,
๋ถ„๋ฅ˜์ž ๊ธฐ๋ฐ˜POSTagger,
)
nltk.tag.brill์—์„œ BrillTagger ๊ฐ€์ ธ์˜ค๊ธฐ
nltk.tag.brill_trainer์—์„œ BrillTaggerTrainer ๊ฐ€์ ธ์˜ค๊ธฐ
nltk.tag.tnt์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ TnT
nltk.tag.hunpos์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ HunposTagger
nltk.tag.stanford์—์„œ ๊ฐ€์ ธ์˜ค๊ธฐ StanfordTagger, StanfordPOSTagger, StanfordNERTagger
nltk.tag.hmm์—์„œ HiddenMarkovModelTagger, HiddenMarkovModelTrainer ๊ฐ€์ ธ์˜ค๊ธฐ
nltk.tag.senna์—์„œ SennaTagger, SennaChunkTagger, SennaNERTagger ๊ฐ€์ ธ์˜ค๊ธฐ
nltk.tag.mapping์—์„œ import tagset_mapping, map_tag
nltk.tag.crf์—์„œ CRFTagger ๊ฐ€์ ธ์˜ค๊ธฐ
nltk.tag.perceptron์—์„œ PerceptronTagger ๊ฐ€์ ธ์˜ค๊ธฐ

nltk.data ๊ฐ€์ ธ์˜ค๊ธฐ ๋กœ๋“œ์—์„œ ์ฐพ๊ธฐ

RUS_PICKLE = (
'taggers/averaged_perceptron_tagger_ru/averaged_perceptron_tagger_ru.pickle'
)

def _get_tagger(lang=์—†์Œ):
lang == 'rus'์ธ ๊ฒฝ์šฐ:
tagger = PerceptronTagger(๊ฑฐ์ง“)
ap_russian_model_loc = 'ํŒŒ์ผ:' + str(์ฐพ๊ธฐ(RUS_PICKLE))
tagger.load(ap_russian_model_loc)
๋˜ ๋‹ค๋ฅธ:
ํƒœ๊ฑฐ = PerceptronTagger()
๋ฆฌํ„ด ํƒœ๊ฑฐ

def _pos_tag(ํ† ํฐ, tagset=์—†์Œ, tagger=์—†์Œ, lang=์—†์Œ):
# ํ˜„์žฌ ์˜์–ด์™€ ๋Ÿฌ์‹œ์•„์–ด๋งŒ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
lang์ด ['eng', 'rus']์— ์—†๋Š” ๊ฒฝ์šฐ:
NotImplementedError(
"ํ˜„์žฌ NLTK pos_tag๋Š” ์˜์–ด์™€ ๋Ÿฌ์‹œ์•„์–ด๋งŒ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค."
"(์˜ˆ: lang='eng' ๋˜๋Š” lang='rus')"
)
๋˜ ๋‹ค๋ฅธ:
tags_tokens = tagger.tag(ํ† ํฐ)
if tagset: # ์ง€์ •๋œ tagset์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค.
lang == 'eng'์ธ ๊ฒฝ์šฐ:
taged_tokens = [
(ํ† ํฐ, map_tag('en-ptb', tagset, tag))
tags_tokens์˜ (ํ† ํฐ, ํƒœ๊ทธ)
]
elif lang == '๋ฃจ์Šค':
# ๋ชจ๋ธ์˜ ์ƒˆ๋กœ์šด Russion pos ํƒœ๊ทธ์—๋Š” ์ ‘๋ฏธ์‚ฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
# https://github.com/nltk/nltk/issues/2151#issuecomment -430709018 ์ฐธ์กฐ
taged_tokens = [
(ํ† ํฐ, map_tag('ru-rnc-new', tagset, tag.partition('=')[0]))
tags_tokens์˜ (ํ† ํฐ, ํƒœ๊ทธ)
]
ํƒœ๊ทธ๋œ_ํ† ํฐ ๋ฐ˜ํ™˜

def pos_tag(ํ† ํฐ, tagset=์—†์Œ, lang='eng'):
""
NLTK์—์„œ ํ˜„์žฌ ๊ถŒ์žฅํ•˜๋Š” ํ’ˆ์‚ฌ ํƒœ๊ฑฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ
์ฃผ์–ด์ง„ ํ† ํฐ ๋ชฉ๋ก์— ํƒœ๊ทธ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

    >>> from nltk.tag import pos_tag
    >>> from nltk.tokenize import word_tokenize
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
    ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
    [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
    ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.

:param tokens: Sequence of tokens to be tagged
:type tokens: list(str)
:param tagset: the tagset to be used, e.g. universal, wsj, brown
:type tagset: str
:param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
:type lang: str
:return: The tagged tokens
:rtype: list(tuple(str, str))
"""
tagger = _get_tagger(lang)
return _pos_tag(tokens, tagset, tagger, lang)

def pos_tag_sents(๋ฌธ์žฅ, tagset=์—†์Œ, lang='eng'):
""
NLTK์—์„œ ํ˜„์žฌ ๊ถŒ์žฅํ•˜๋Š” ํ’ˆ์‚ฌ ํƒœ๊ฑฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํƒœ๊ทธ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
์ฃผ์–ด์ง„ ๋ฌธ์žฅ ๋ชฉ๋ก, ๊ฐ๊ฐ์€ ํ† ํฐ ๋ชฉ๋ก์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

:param tokens: List of sentences to be tagged
:type tokens: list(list(str))
:param tagset: the tagset to be used, e.g. universal, wsj, brown
:type tagset: str
:param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
:type lang: str
:return: The list of tagged sentences
:rtype: list(list(tuple(str, str)))
"""
tagger = _get_tagger(lang)
return [_pos_tag(sent, tagset, tagger, lang) for sent in sentences]
์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰