Nltk: Porter stemmer 返回大写而不是小写

创建于 2020-02-25 · 5评论 · 资料来源: nltk/nltk

此输出是意外的。 In In从 PorterStemmer 的输出中返回大写的

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'

有关https://stackoverflow.com/q/60387288/610569 的更多详细信息

goodfirstbug stelemma

资料来源

alvations

所有5条评论

对于任何词干提取，我们不应该首先将它们转换为小写作为规范化的一部分吗？

brlrb 于 2020-02-25

另一个使用Oh的大写输出示例

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('Oh')
'Oh'

brlrb 于 2020-02-25

👍1

我想是因为原来它想在字长小于2的时候保持缩写的原始形式，就像states的缩写一样。

def stem(self, word):
    stem = word.lower()

    if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
        return self.pool[word]

    if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
        # With this line, strings of length 1 or 2 don't go through
        # the stemming process, although no mention is made of this
        # in the published algorithm.
        return word

    stem = self._step1a(stem)
    stem = self._step1b(stem)
    stem = self._step1c(stem)
    stem = self._step2(stem)
    stem = self._step3(stem)
    stem = self._step4(stem)
    stem = self._step5a(stem)
    stem = self._step5b(stem)

    return stem

单词 'In' 和 'Oh' 都不在self.pool和len(word)<=2 ，这意味着它不满足这里的词干条件，所以它保持不变。

如果它没有被分配，我可以处理这个吗？这是我第一次为项目做出贡献。

PhanatosZou 于 2020-02-27

👍1

@PhanatosZou 可以随意进行更改并创建拉取请求。需要修改的主要代码是：

if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
    # With this line, strings of length 1 or 2 don't go through
    # the stemming process, although no mention is made of this
    # in the published algorithm.
    return stem

但是最好检查self.pool所有键，如果它们是非大写的，那么也要对此进行更改：

if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
    return self.pool[stem]

alvations 于 2020-02-27

听起来不错！我会努力的。

PhanatosZou 于 2020-02-27

此页面是否有帮助？

0 / 5 - 0 等级

Nltk: Porter stemmer 返回大写而不是小写

所有5条评论

相关问题