Nltk: Porter stemmer returns a capital instead of lowercase

Created on 25 Feb 2020 · 5Comments · Source: nltk/nltk

This output is unexpected. The In returns the capitalize In from PorterStemmer's output.

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'

More details on https://stackoverflow.com/q/60387288/610569

goodfirstbug stelemma

Source

alvations

All 5 comments

For any stemming, are we not first supposed to convert them into lowercase as part of normalization?

brlrb on 25 Feb 2020

Another example of capitalized output using Oh

>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('Oh')
'Oh'

brlrb on 25 Feb 2020

👍1

I think it's because originally it wants to remain the original form of abbreviations when the word length less than 2, like abbreviations for states.

def stem(self, word):
    stem = word.lower()

    if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
        return self.pool[word]

    if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
        # With this line, strings of length 1 or 2 don't go through
        # the stemming process, although no mention is made of this
        # in the published algorithm.
        return word

    stem = self._step1a(stem)
    stem = self._step1b(stem)
    stem = self._step1c(stem)
    stem = self._step2(stem)
    stem = self._step3(stem)
    stem = self._step4(stem)
    stem = self._step5a(stem)
    stem = self._step5b(stem)

    return stem

The word 'In' and 'Oh' both are not in self.pool, and len(word)<=2, which means it doesn't satisfy the stem condition here, so it remains the same.

If it's not assigned, may I work on this? Its my first time contributing to a project.

PhanatosZou on 27 Feb 2020

👍1

@PhanatosZou feel free to make the changes and create a pull request. The main code that needs to be changed is:

if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
    # With this line, strings of length 1 or 2 don't go through
    # the stemming process, although no mention is made of this
    # in the published algorithm.
    return stem

But it'll be good to check all keys in self.pool and if they are non-caps, then make changes to this too:

if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
    return self.pool[stem]

alvations on 27 Feb 2020

Sounds good! I'll work on it.

PhanatosZou on 27 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

CoreNLPNERTagger throws HTTPError: 500 Server Error: Internal Server Error for url: ......

hexingren · 22Comments

panlex_lite installation via nltk.download() appears to fail

grayben · 32Comments

corenlp.py CoreNLPServer throws TypeError exception

f0lie · 21Comments

A weird edge case for bleu scoring.

benleetownsend · 17Comments

AttributeError: module 'nltk' has no attribute 'download

2hands10fingers · 16Comments