Nltk: ポーターステマーは小文字の代わりに大文字を返します

作成日 2020年02月25日 · 5コメント · ソース: nltk/nltk

この出力は予期しないものです。 Inは、PorterStemmerの出力から大文字のInを返します。

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'

https://stackoverflow.com/q/60387288/610569の詳細

goodfirstbug stelemma

ソース

alvations

全てのコメント5件

ステミングについては、正規化の一部として最初に小文字に変換することになっているのではないでしょうか。

brlrb 2020年02月25日

Ohを使用した大文字の出力の別の例

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('Oh')
'Oh'

brlrb 2020年02月25日

👍1

もともとは、州の略語のように、単語の長さが2未満の場合でも、元の略語の形式を維持したいからだと思います。

def stem(self, word):
    stem = word.lower()

    if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
        return self.pool[word]

    if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
        # With this line, strings of length 1 or 2 don't go through
        # the stemming process, although no mention is made of this
        # in the published algorithm.
        return word

    stem = self._step1a(stem)
    stem = self._step1b(stem)
    stem = self._step1c(stem)
    stem = self._step2(stem)
    stem = self._step3(stem)
    stem = self._step4(stem)
    stem = self._step5a(stem)
    stem = self._step5b(stem)

    return stem

「In」と「Oh」という単語はどちらもself.poolとlen(word)<=2に含まれていません。つまり、ここでは語幹の条件を満たしていないため、同じままです。

割り当てられていない場合、これに取り組むことはできますか？プロジェクトに貢献するのは初めてです。

PhanatosZou 2020年02月27日

👍1

@PhanatosZouは、自由に変更を加えてプルリクエストを作成してください。変更が必要な主なコードは次のとおりです。

if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
    # With this line, strings of length 1 or 2 don't go through
    # the stemming process, although no mention is made of this
    # in the published algorithm.
    return stem

ただし、 self.pool内のすべてのキーを確認し、それらが大文字でない場合は、これにも変更を加えることをお勧めします。

if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
    return self.pool[stem]

alvations 2020年02月27日

いいですね！私はそれに取り組みます。

PhanatosZou 2020年02月27日

このページは役に立ちましたか？

0 / 5 - 0 評価

Nltk: ポーターステマーは小文字の代わりに大文字を返します

全てのコメント5件

関連する問題