Nltk: 搬运工声音：字符串索引超出范围

创建于 2017-01-07 · 12评论 · 资料来源: nltk/nltk

请参阅以下stackoverflow帖子

pleaseverify

资料来源

jkarimi91

最有用的评论

@fievelk你说得对。抱歉，是的：您可以使用develop分支或3.2.1来消除该错误。

ExplodingCabbage 于 2017-02-10

👍3

所有12条评论

为了将来参考，我在此处复制/粘贴您的问题：

我有一组腌制的文本文档，我想使用 nltk 的PorterStemmer来删除它们。由于特定于我的项目的原因，我想在 django 应用程序视图中进行词干提取。

但是，当在 django 视图中提取文档时，我收到来自PorterStemmer().stem()的IndexError: string index out of range异常，用于字符串'oed' 。结果，运行以下命令：

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

引发上述错误：

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

现在真正奇怪的是在 django 之外的同一个字符串上运行相同的词干分析器（无论是单独的 python 文件还是交互式 python 控制台）都不会产生错误。换句话说：

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

其次是：

python test.py
# successfully prints 'o'

是什么导致了这个问题？

fievelk 于 2017-01-07

👍2

我发现此问题特定于 nltk 3.2.2 版。最初，我使用 ipython 而不是 python 运行test.py ，如上所述。不知何故，我能够在我的根环境//anaconda/bin/ipython访问 ipython 安装，即使我没有在我的 django 项目（激活的）虚拟环境//anaconda/envs/xkcd/bin/指定 ipython 。因此，ipython 必须使用在我的根环境中定义的 nltk 安装以及运行版本 3.2.0。

为了澄清起见，我发现PorterStemmer 'oed'在 nltk 3.2.2 版中无法阻止字符串

作为旁注，我在这两种情况下都使用了 python 2。我的根环境使用 python 2.7.11，我的 django 项目环境使用 python 2.7.13

jkarimi91 于 2017-01-07

👍1

嘿，
抱歉这个（问题）。我的意思是我从不使用 github，它是
不小心发生了。我不知道我刚刚触发了什么！

2017 年 1 月 7 日晚上 11:47，“jkarimi91”通知@github.com 写道：

我发现此问题特定于 nltk 3.2.2 版。
最初，我使用 ipython 而不是 python 运行 test.py，如上所述。
不知何故，我能够在我的根目录中访问 ipython 安装
环境 //anaconda/bin/ipython 即使我没有指定
ipython 在我当前激活的虚拟环境中
//anaconda/envs/xkcd/bin/. 因此，ipython 一定一直在使用
在我的根环境中定义的 nltk 安装以及运行版本
3.2.0.
澄清一下，我发现 PorterStemmer 无法阻止
nltk 3.2.2 版中的字符串 'oed' 但 nltk 3.2.0 版中没有。为什么我
不知道。
—
您收到此消息是因为您订阅了此线程。
直接回复本邮件，在GitHub上查看
https://github.com/nltk/nltk/issues/1581#issuecomment-271100268或静音
线程
https://github.com/notifications/unsubscribe-auth/AVTBBiywlg5c81StFrrcNOsyuF610y9uks5rP9bLgaJpZM4LdV66
.

Ric13RK 于 2017-01-07

@ExplodingCabbage你能调查一下这个问题吗？在3.2发布后，我能在porter.py上看到的唯一提交是 d8402e3f43ce3b7a3c7ecb45c3b8b1f75c7124e2。

fievelk 于 2017-01-07

这是@jkarimi91 提供的示例中使用的代码。

from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

上面的调试使用的代码pdb从内部_apply_rule_list()的porter.py ，经过几次反复，你可以：

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

此时， _ends_double_consonant()方法尝试执行word[-1] == word[-2]并且失败了。

如果我没记错的话，在 NLTK 3.2的相对方法如下：

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):      
        return False        
    return self._cons(word, len(word)-1)

据我所知，新版本中缺少len(word) < 2支票。

将_ends_double_consonant()更改

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

fievelk 于 2017-01-07

哎呀。是的，看起来我在https://github.com/nltk/nltk/commit/d8402e3f43ce3b7a3c7ecb45c3b8b1f75c7124e2 :(

今晚将进行公关测试和修复。

ExplodingCabbage 于 2017-01-07

👍3

感谢@ jkarimi91，@fievelk，@ExplodingCabbage

stevenbird 于 2017-01-07

👍1

嗨，我今天遇到了完全相同的问题。你能建议我如何解决这个问题吗？我应该更新任何软件包吗？

santoshbs 于 2017-02-10

嗨@santoshbs。您可以使用 NLTK 的master版本或发布3.2.1来摆脱该错误；它只存在于3.2.2 。

ExplodingCabbage 于 2017-02-10

@ExplodingCabbage我认为您指的是develop分支（不是master ）。我猜很容易混淆:)

fievelk 于 2017-02-10

👍2

@fievelk你说得对。抱歉，是的：您可以使用develop分支或3.2.1来消除该错误。

ExplodingCabbage 于 2017-02-10

👍3

非常感谢指点。

santoshbs 于 2017-02-10

此页面是否有帮助？

0 / 5 - 0 等级

Nltk: 搬运工声音：字符串索引超出范围

最有用的评论

所有12条评论

相关问题