This output is unexpected. The In
returns the capitalize In
from PorterStemmer's output.
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('In')
'In'
More details on https://stackoverflow.com/q/60387288/610569
For any stemming, are we not first supposed to convert them into lowercase as part of normalization?
Another example of capitalized output using Oh
>>> from nltk.stem import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem('Oh')
'Oh'
I think it's because originally it wants to remain the original form of abbreviations when the word length less than 2, like abbreviations for states.
def stem(self, word):
stem = word.lower()
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[word]
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return word
stem = self._step1a(stem)
stem = self._step1b(stem)
stem = self._step1c(stem)
stem = self._step2(stem)
stem = self._step3(stem)
stem = self._step4(stem)
stem = self._step5a(stem)
stem = self._step5b(stem)
return stem
The word 'In' and 'Oh' both are not in self.pool
, and len(word)<=2
, which means it doesn't satisfy the stem condition here, so it remains the same.
If it's not assigned, may I work on this? Its my first time contributing to a project.
@PhanatosZou feel free to make the changes and create a pull request. The main code that needs to be changed is:
if self.mode != self.ORIGINAL_ALGORITHM and len(word) <= 2:
# With this line, strings of length 1 or 2 don't go through
# the stemming process, although no mention is made of this
# in the published algorithm.
return stem
But it'll be good to check all keys in self.pool
and if they are non-caps, then make changes to this too:
if self.mode == self.NLTK_EXTENSIONS and word in self.pool:
return self.pool[stem]
Sounds good! I'll work on it.