Nltk: Porter stemmer's output inconsistent with that of reference implementations

Created on 17 Jan 2012  ·  8Comments  ·  Source: nltk/nltk

I recently used the NLTK's Porter stemmer and discovered some disrepancies between its output and that of another version of the Porter stemmer that I have used. Following up on these discrepancies, I discovered that there may be some problems with the NLTK's implementation.

There are various reference implementations of the Porter stemmer collected by Martin Porter himself here:

http://tartarus.org/~martin/PorterStemmer/

I tried the one for Ruby to sanity check the NLTK (it takes a word from standard out and spits out its stemmed form immediately, also on standard out):

scripts $ ruby porter_stemmer.rb
shiny
shini

I checked this against the NLTK:

scripts $ python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

        import nltk
        nltk.stem.porter.PorterStemmer().stem_word('shiny')

'shini'

So far, so good. And if I compare these results to the expected results (output.txt) provided on the same page for a sample user dict file (voc.txt), they look right:

scripts $ egrep -n '^shiny$' voc.txt
18333:shiny
scripts $ egrep -n '^shini$' output.txt
18333:shini

But if I systematically compare the NLTK results against those in the expected results file, I see a lot of discrepancies (format: word, proper expected stemming, bad NLTK stemming marked by asterisk):

scripts $ ./show_bad_stemming_in_nltk.py voc.txt output.txt
abbey abbei *abbey
abbeys abbei *abbey
abed ab *abe
absey absei *absey
[...]
wrying wry *wri
yesterday yesterdai *yesterday
yesterdays yesterdai *yesterday
yongrey yongrei *yongrey

Spot checking against the Ruby reference implemenation confirms that the NLTK results are in fact the problem:

scripts $ ruby porter_stemmer.rb
abbey
abbei
abbeys
abbei
abed
ab
absey
absei
wrying
wry
yesterday
yesterdai
yesterdays
yesterdai
yongrey
yongrei

I've attached the full list of words for which the NLTK's Porter stemmer provides unexpected results.

(This bug pertains to version 2.0b9. I suspect it exists in all previous versions, but I haven't confirmed that.)

Migrated from http://code.google.com/p/nltk/issues/detail?id=625


earlier comments

gregg.lind said, at 2011-02-09T20:22:57.000Z:

I would be happy to take this one on (by friday), if no one else wants it. I have used this module a lot. Gregg Lind

StevenBird1 said, at 2011-02-14T07:14:49.000Z:

Attached is Stuart Robinson's fixes, sent to nltk-dev. This seems to be further edits to the previous version rather than a fresh port of the Ruby version as originally discussed. Before this new version can be incorporated we need a set of test cases added to test/stem.doctest.

goodfirstbug

Most helpful comment

@paulproteus – finally resolved

All 8 comments

I just put the "goodfirstbug" label on this one. The "good first bug" would be adding a bunch of test cases to stem.doctest (https://github.com/nltk/nltk/blob/master/nltk/test/stem.doctest) -- if you feel like merging in Stuart's fixes too, that would be great!

Hey all, especially @alexrudnick , should this be marked as resolved now?

Upon a skim, looks like no, it's not yet resolved. But would like to hear from a maintainer.

@paulproteus – finally resolved

Note that the _default_ stemmer behavior as of my PR that Steven just merged is unchanged; you need to explicitly pass mode=PorterStemmer.MARTIN_EXTENSIONS to the PorterStemmer constructor to get behavior that's consistent with Martin's reference implementations (which are themselves inconsistent with Martin's original algorithm).

Arguably, having MARTIN_EXTENSIONS as the default mode (for consistency with the reference implementations) would be better, since users are going to expect something called PorterStemmer to behave out of the box like Martin's reference implementations. The trouble is that that would be a backwards compatibility break, and a nastily non-obvious one; somebody using NLTK's previous implementation who upgrades NLTK could fail to notice for a long time that just a few of their stems have changed, possibly introducing subtle bugs depending upon their use case. Another option is to not have a default value at all, and require every user to read the docs on the different modes and explicitly choose which one to use. This would break backwards-compatibility too, but would do so in an _obvious_ way (just trying to instantiate the stemmer the old way would blow up) so people doing upgrades without reading the release notes wouldn't be caught out. That approach would avoid either of the bad scenarios above, but at the cost of demanding more up-front work from every new user of the stemmer.

None of the options is perfect. I've opted for having NLTK_EXTENSIONS as the default, but there's room to disagree. Anybody have an opinion?

@ExplodingCabbage: I favour your last option, no default, and doing that with the next major release (not the minor release where your work will first appear), with suitable warnings in the release notes. I think it would be good for people to be forced to read the docs and benefit from your work. I'm curious to know what others think.

Hi guys,

I've been writing a search engine backend with the default PorterStemmer in nltk, not knowing that it doesn't behave the same way as many other Porter stemmer implementations. Now that I'm working on the front-end using Javascript I'm running into bugs where the frontend and my backend stem words differently. I was wondering what I should look at if i need to recreate nltk's default PorterStemmer behavior in Javascript so I can run it in the browser. I was hoping maybe someone where (maybe @ExplodingCabbage ?) could point me to the right direction.

I really don't have time to re-index everything with the MARTIN_EXTENSIONS mode as it would take weeks to do it...

@josephcc porter.py has basically no dependencies and isn't doing anything profound or magical, just a long series of string manipulations. You'll just want to port the PorterStemmer class to JavaScript, preserving only the if self.mode == self.NLTK_EXTENSIONS branches and throwing away the logic from the others.

Not a 5-minute task or a foolproof one, admittedly, but doable. Also check out PorterTest in https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_stem.py and consider running the test cases there against your JavaScript implementation to check the correctness of your work.

Was this page helpful?
0 / 5 - 0 ratings