Nltk: 多重处理和nltk配合不好

创建于 2015-04-16 · 22评论 · 资料来源: nltk/nltk

老实说，这个问题并不像它好奇的那样严重。我发现导入NLTK时，它将导致任何Python子进程在网络调用中过早终止。示例代码：

from multiprocessing import Process
import nltk
import time


def child_fn():
    print "Fetch URL"
    import urllib2
    print urllib2.urlopen("https://www.google.com").read()[:100]
    print "Done"


while True:
    child_process = Process(target=child_fn)
    child_process.start()
    child_process.join()
    print "Child process returned"
    time.sleep(1)

在导入了NLTK的情况下运行它，您将看到urlopen（）调用永远不会执行。注释掉import nltk行，它执行得很好。

为什么？

*编辑：这是针对Python 2的。我尚未在3上对其进行测试。

admin bug inactive multithread / multiprocessing pythonic

资料来源

oxymor0n

👍1

最有用的评论

我对nltk并不是很熟悉，但是我四处张望，看看是什么导致测试通过/失败。这是我必须对软件包__init__.py进行的操作才能通过测试：

详细信息（单击以展开）

###########################################################
# TOP-LEVEL MODULES
###########################################################

# Import top-level functionality into top-level namespace

from nltk.collocations import *
from nltk.decorators import decorator, memoize
# from nltk.featstruct import *
# from nltk.grammar import *
from nltk.probability import *
from nltk.text import *
# from nltk.tree import *
from nltk.util import *
from nltk.jsontags import *

# ###########################################################
# # PACKAGES
# ###########################################################

# from nltk.chunk import *
# from nltk.classify import *
# from nltk.inference import *
from nltk.metrics import *
# from nltk.parse import *
# from nltk.tag import *
from nltk.tokenize import *
from nltk.translate import *
# from nltk.sem import *
# from nltk.stem import *

# Packages which can be lazily imported
# (a) we don't import *
# (b) they're slow to import or have run-time dependencies
#     that can safely fail at run time

from nltk import lazyimport
app = lazyimport.LazyModule('nltk.app', locals(), globals())
chat = lazyimport.LazyModule('nltk.chat', locals(), globals())
corpus = lazyimport.LazyModule('nltk.corpus', locals(), globals())
draw = lazyimport.LazyModule('nltk.draw', locals(), globals())
toolbox = lazyimport.LazyModule('nltk.toolbox', locals(), globals())

# Optional loading

try:
    import numpy
except ImportError:
    pass
else:
    from nltk import cluster

# from nltk.downloader import download, download_shell
# try:
#     from six.moves import tkinter
# except ImportError:
#     pass
# else:
#     try:
#         from nltk.downloader import download_gui
#     except RuntimeError as e:
#         import warnings
#         warnings.warn("Corpus downloader GUI not loaded "
#                       "(RuntimeError during import: %s)" % str(e))

# explicitly import all top-level modules (ensuring
# they override the same names inadvertently imported
# from a subpackage)

# from nltk import ccg, chunk, classify, collocations
# from nltk import data, featstruct, grammar, help, inference, metrics
# from nltk import misc, parse, probability, sem, stem, wsd
# from nltk import tag, tbl, text, tokenize, translate, tree, treetransforms, util

有趣的是，所有残疾人进口最终都导致进口tkinter ，我认为这是根本原因。如果在测试脚本中将import nltk替换import tkinter ，则会得到非常相似的崩溃报告，都引用了tkinter。

据我所知，这些软件包直接导入tkinter ：

nltk.app
nltk.draw
nltk.sem

从上面对主包__init__更改中，这些都是有问题的导入，以及它们追溯到导入tkinter的方式

nltk.featstruct （ sem ）
nltk.grammar （ featstruct ）
nltk.tree （ grammar ）
nltk.chunk （ chunk.named_entity > tree ）
nltk.parse （ parse.bllip > tree ）
nltk.tag （ tag.stanford > parse ）
nltk.classify （ classify.senna > tag ）
nltk.inference （ inference.discourse > sem ， tag ）
nltk.stem （ stem.snowball > corpus > corpus.reader.timit > tree ）

rpkilby 于 2018-05-12

👍3

所有22条评论

你有什么例外吗？

dimazest 于 2015-04-16

没有。我在import urllib2; print...周围放置了try .. except:子句，但没有任何import urllib2; print... 。

oxymor0n 于 2015-04-16

我遇到了完全相同的问题。我刚刚打开了一个SO问题，可能会在这里链接： http :

子进程确实无提示地崩溃，没有进一步通知。

我不同意你的@ oxymor0n ，对我来说这是一个严重的问题。这基本上意味着，每当导入nltk时，就无法从子进程发出请求，这在使用API时确实很烦人。

Hiestaa 于 2015-06-10

👍1

The child process is indeed crashing silently without further notice.

我们还遇到以下问题的组合：nltk，gunicorn（通过预叉装载nltk）和长颈瓶。

删除nltk导入，一切正常。除了nltk。

/ cc @escherba

ninowalker 于 2015-08-05

👍2

@ ninowalker ， @ oxymor0n奇怪的是，我的程序在代码中运行良好，得到了：

Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned

那是预期的输出，对吧？

这也不会破坏我的要求：

alvas<strong i="13">@ubi</strong>:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing import Process
>>> import requests
>>> from pprint import pprint
>>> Process(target=lambda: pprint(
...         requests.get('https://api.github.com'))).start()
>>> <Response [200]>

>>> import nltk
>>> Process(target=lambda: pprint(
...         requests.get('https://api.github.com'))).start()
>>> <Response [200]>

我正在使用：

python 2.7.6
nltk 3.0.3
Ubuntu 14.04

alvations 于 2015-08-06

我遇到了@Hiestaa遇到的相同问题。我有导入nltk的帮助程序文件string_util.python，但是在使用多处理模块启动多进程搜寻器的主python文件中未使用该文件。症状是子进程只是卡住了，没有错误消息（甚至没有异常消息）。
在注释掉与nltk相关的导入和功能后，问题得到解决。

细节：
作业系统：优胜美地10.10.5
的Python：2.7.10
检索页面内容：我最初使用urllib2，然后稍后切换到请求。

这是一个非常严重的错误，我希望有人可以介入并进行修复。谢谢！

wenbowang 于 2016-03-05

如果您要进行生产级NLP，我认为这是一个严重的问题。我们正在使用Rq（http://python-rq.org/）工人来运行多个NLP管道，这在进行网络调用时会被无声地杀死。希望很快会解决。谢谢！

sasinda 于 2016-10-13

👍2

@sasinda ：您可能想在nltk-dev邮件列表上打个电话，看看是否可以对此问题有所注意。

stevenbird 于 2016-10-13

@sasinda我不确定Rq的工作原理，但是在我的生产级NLP项目中，我设法通过在单独且隔离的python解释器中启动每个进程来解决此问题，并使用Shell脚本在启动时生成它们。在这种情况下，python永远不必派生，并且nltk的无提示崩溃永远不会发生。也许这会有所帮助。

Hiestaa 于 2016-10-13

我发现在功能级别执行导入可以避免此问题。

换句话说，这有效：

def split(words):
    import nltk
    return nltk.word_tokenize(words)

这不是：

import nltk
def split(words):
    return nltk.word_tokenize(words)

mpenkov 于 2016-12-27

👍2

谢谢@mpenkov。这样可以解决问题吗？

stevenbird 于 2017-01-04

@stevenbird我不这么认为。这是一种解决方法，但不是解决方法。

恕我直言，如果导入第三方库破坏了Python标准库组件，则某些地方发生了不正当的事情，需要对其进行修复。

mpenkov 于 2017-01-05

👍1

@mpenkov我不完全确定为什么这样做，但是这是我发现有效的另一种解决方法。在父进程中构建一个打开器似乎可以解决该问题。修改@ oxymor0n的原始代码：

from multiprocessing import Process
import nltk
import time
import urllib2

# HACK
urllib2.build_opener(urllib2.HTTPHandler())

def child_fn():
    print "Fetch URL"
    import urllib2
    print urllib2.urlopen("https://www.google.com").read()[:100]
    print "Done"


while True:
    child_process = Process(target=child_fn)
    child_process.start()
    child_process.join()
    print "Child process returned"
    time.sleep(1)

sudowork 于 2017-01-05

@mpenkov @ ninowalker ， @ oxymor0n @sasinda @wenbowang你们仍然都面临着同样的问题吗？

我无法在我的机器上复制该问题：

from multiprocessing import Process
import nltk
import time

def child_fn():
    print "Fetch URL"
    import urllib2
    print urllib2.urlopen("https://www.google.com").read()[:100]
    print "Done"


while True:
    child_process = Process(target=child_fn)
    child_process.start()
    child_process.join()
    print "Child process returned"
    time.sleep(1)

给我：

Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-SG"><head><meta cont
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-SG"><head><meta cont
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-SG"><head><meta cont
Done

我上线了：

Python 2.7.12
NLTK 3.2.5
Ubuntu 16.04.3

alvations 于 2017-10-14

@alvations我发现这个问题已有很长时间了。
我什至忘记了哪个项目库存在此问题，所以我无法告诉您是否仍然存在此问题。
抱歉!

wenbowang 于 2017-10-14

👍1

@alvations我也不记得我的哪个项目遭受了这个特定问题的困扰。

我在计算机上运行了您的代码，但无法复制该问题。

Python 2.7.12
nltk 3.2.1
macOS 10.12.6

mpenkov 于 2017-10-14

👍1

@alvations我也不再从事该项目。但是使用了这些解决方法之一。
我尝试了您的代码，但是对于我来说，子进程仍然以段错误（退出代码11）退出（退出行：urllib2.urlopen（“ https://www.google.com”）.read（）[：100]）

它与urllib3（https://urllib3.readthedocs.io/en/latest/）一起使用。

nltk（3.2.5）
urllib3（1.22）
Mac OSX 10.12.16
Python 2.7.13 | Continuum Analytics，Inc. | （默认值，2016年12月20日，23：05：08）
达尔文[GCC 4.2.1兼容Apple LLVM 6.0（clang-600.0.57）]

sasinda 于 2017-10-18

据我所知，这个问题似乎影响了macOS。到目前为止，使用Python 3.6

macOS 10.13（失败）
Centos 7.2（成功）
Ubuntu 16.04（成功）

修改后的python3的OP脚本：

from multiprocessing import Process
import nltk
import time


def child_fn():
    from urllib.request import urlopen
    print("Fetch URL")
    print(urlopen("https://www.google.com").read()[:100])
    print("Done")


child_process = Process(target=child_fn)
child_process.start()
child_process.join()
print("Child process returned")
time.sleep(1)

输出：

Fetch URL
Child process returned

子流程意外退出，收到与此Stack Overflow post中类似的输出。

rpkilby 于 2018-05-12

👍1

我认为这很令人困惑。它可能与MacOS上的线程处理有关。

alvations 于 2018-05-12

👍1

我对nltk并不是很熟悉，但是我四处张望，看看是什么导致测试通过/失败。这是我必须对软件包__init__.py进行的操作才能通过测试：

详细信息（单击以展开）

###########################################################
# TOP-LEVEL MODULES
###########################################################

# Import top-level functionality into top-level namespace

from nltk.collocations import *
from nltk.decorators import decorator, memoize
# from nltk.featstruct import *
# from nltk.grammar import *
from nltk.probability import *
from nltk.text import *
# from nltk.tree import *
from nltk.util import *
from nltk.jsontags import *

# ###########################################################
# # PACKAGES
# ###########################################################

# from nltk.chunk import *
# from nltk.classify import *
# from nltk.inference import *
from nltk.metrics import *
# from nltk.parse import *
# from nltk.tag import *
from nltk.tokenize import *
from nltk.translate import *
# from nltk.sem import *
# from nltk.stem import *

# Packages which can be lazily imported
# (a) we don't import *
# (b) they're slow to import or have run-time dependencies
#     that can safely fail at run time

from nltk import lazyimport
app = lazyimport.LazyModule('nltk.app', locals(), globals())
chat = lazyimport.LazyModule('nltk.chat', locals(), globals())
corpus = lazyimport.LazyModule('nltk.corpus', locals(), globals())
draw = lazyimport.LazyModule('nltk.draw', locals(), globals())
toolbox = lazyimport.LazyModule('nltk.toolbox', locals(), globals())

# Optional loading

try:
    import numpy
except ImportError:
    pass
else:
    from nltk import cluster

# from nltk.downloader import download, download_shell
# try:
#     from six.moves import tkinter
# except ImportError:
#     pass
# else:
#     try:
#         from nltk.downloader import download_gui
#     except RuntimeError as e:
#         import warnings
#         warnings.warn("Corpus downloader GUI not loaded "
#                       "(RuntimeError during import: %s)" % str(e))

# explicitly import all top-level modules (ensuring
# they override the same names inadvertently imported
# from a subpackage)

# from nltk import ccg, chunk, classify, collocations
# from nltk import data, featstruct, grammar, help, inference, metrics
# from nltk import misc, parse, probability, sem, stem, wsd
# from nltk import tag, tbl, text, tokenize, translate, tree, treetransforms, util

据我所知，这些软件包直接导入tkinter ：

nltk.app
nltk.draw
nltk.sem

从上面对主包__init__更改中，这些都是有问题的导入，以及它们追溯到导入tkinter的方式

nltk.featstruct （ sem ）
nltk.grammar （ featstruct ）
nltk.tree （ grammar ）
nltk.chunk （ chunk.named_entity > tree ）
nltk.parse （ parse.bllip > tree ）
nltk.tag （ tag.stanford > parse ）
nltk.classify （ classify.senna > tag ）
nltk.inference （ inference.discourse > sem ， tag ）
nltk.stem （ stem.snowball > corpus > corpus.reader.timit > tree ）

rpkilby 于 2018-05-12

👍3

感谢@rpkilby ，这非常有帮助！

看起来像这个问题https://stackoverflow.com/questions/16745507/tkinter-how-to-use-threads-to-preventing-main-event-loop-from-freezing

我认为很长时间以来，叮叮当便一直是我们的痛点。也许，如果我们能找到它的替代方法，那将会很好。

alvations 于 2018-05-13

👍1

我同意。较短期的解决方案是将tkinter导入埋入需要tkinter的类和方法中，并避免通过不需要它的程序导入它。我们已经为numpy做过类似的事情。

stevenbird 于 2018-05-13

👍2

此页面是否有帮助？

0 / 5 - 0 等级

Nltk: 多重处理和nltk配合不好

最有用的评论

所有22条评论

相关问题