Nltk: Porter ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋Š” ์†Œ๋ฌธ์ž ๋Œ€์‹  ๋Œ€๋ฌธ์ž๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

์— ๋งŒ๋“  2020๋…„ 02์›” 25์ผ  ยท  5์ฝ”๋ฉ˜ํŠธ  ยท  ์ถœ์ฒ˜: nltk/nltk

์ด ์ถœ๋ ฅ์€ ์˜ˆ๊ธฐ์น˜ ์•Š์€ ๊ฒƒ์ž…๋‹ˆ๋‹ค. In ๋Š” PorterStemmer์˜ ์ถœ๋ ฅ์—์„œ โ€‹โ€‹๋Œ€๋ฌธ์ž In ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'

https://stackoverflow.com/q/60387288/610569 ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ

goodfirstbug stelemma

๋ชจ๋“  5 ๋Œ“๊ธ€

ํ˜•ํƒœ์†Œ ๋ถ„์„์˜ ๊ฒฝ์šฐ ๋จผ์ € ์ •๊ทœํ™”์˜ ์ผ๋ถ€๋กœ ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๊นŒ?

Oh ์‚ฌ์šฉํ•œ ๋Œ€๋ฌธ์ž ์ถœ๋ ฅ์˜ ๋˜ ๋‹ค๋ฅธ ์˜ˆ

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('Oh')
'Oh'

state์— ๋Œ€ํ•œ ์•ฝ์–ด์ฒ˜๋Ÿผ ๋‹จ์–ด ๊ธธ์ด๊ฐ€ 2๋ณด๋‹ค ์ž‘์„ ๋•Œ ์›๋ž˜์˜ ์•ฝ์–ด ํ˜•ํƒœ๋ฅผ ์œ ์ง€ํ•˜๊ณ ์ž ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

def stem(self, word):
    stem = word.lower()

    if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
        return self.pool[word]

    if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
        # With this line, strings of length 1 or 2 don't go through
        # the stemming process, although no mention is made of this
        # in the published algorithm.
        return word

    stem = self._step1a(stem)
    stem = self._step1b(stem)
    stem = self._step1c(stem)
    stem = self._step2(stem)
    stem = self._step3(stem)
    stem = self._step4(stem)
    stem = self._step5a(stem)
    stem = self._step5b(stem)

    return stem

'In'๊ณผ 'Oh'๋ผ๋Š” ๋‹จ์–ด๋Š” ๋ชจ๋‘ self.pool ๋ฐ len(word)<=2 ์— ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์—ฌ๊ธฐ์—์„œ ์ค„๊ธฐ ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.

ํ• ๋‹น๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ์ด ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ด๋„ ๋ฉ๋‹ˆ๊นŒ? ํ”„๋กœ์ ํŠธ์— ๊ธฐ์—ฌํ•œ ๊ฒƒ์€ ์ฒ˜์Œ์ž…๋‹ˆ๋‹ค.

@PhanatosZou ์ž์œ ๋กญ๊ฒŒ ๋ณ€๊ฒฝํ•˜๊ณ  ํ’€ ๋ฆฌํ€˜์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜์„ธ์š”. ๋ณ€๊ฒฝํ•ด์•ผ ํ•˜๋Š” ์ฃผ์š” ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
    # With this line, strings of length 1 or 2 don't go through
    # the stemming process, although no mention is made of this
    # in the published algorithm.
    return stem

๊ทธ๋Ÿฌ๋‚˜ self.pool ๋ชจ๋“  ํ‚ค๋ฅผ ํ™•์ธํ•˜๊ณ  ๋Œ€๋ฌธ์ž๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ ์ด๊ฒƒ๋„ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
    return self.pool[stem]

์ข‹์€ ์†Œ๋ฆฌ! ์ž‘์—…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์ด ํŽ˜์ด์ง€๊ฐ€ ๋„์›€์ด ๋˜์—ˆ๋‚˜์š”?
0 / 5 - 0 ๋“ฑ๊ธ‰