Nltk: multiprocessing and nltk don't play nicely together

Created on 16 Apr 2015  ·  22Comments  ·  Source: nltk/nltk

Honestly, this issue is not serious as much as it is curious. I've discovered that when NLTK is imported, it will cause the any Python subprocess to terminates prematurely on a network call. Example code:

from multiprocessing import Process
import nltk
import time


def child_fn():
    print "Fetch URL"
    import urllib2
    print urllib2.urlopen("https://www.google.com").read()[:100]
    print "Done"


while True:
    child_process = Process(target=child_fn)
    child_process.start()
    child_process.join()
    print "Child process returned"
    time.sleep(1)

Run it with NLTK imported, and you'll see that the urlopen() call never gets executed. Comment out the import nltk line, and it executes fine.

Why?

*edit: this is for Python 2. I haven't tested it on 3 yet.

admin bug inactive multithread / multiprocessing pythonic

Most helpful comment

I'm not incredibly familiar with nltk, but I did a little blind poking around to see what caused the test to pass/fail. Here's what I had to do to the package __init__.py in order to make the test pass:


Details (click to expand)

###########################################################
# TOP-LEVEL MODULES
###########################################################

# Import top-level functionality into top-level namespace

from nltk.collocations import *
from nltk.decorators import decorator, memoize
# from nltk.featstruct import *
# from nltk.grammar import *
from nltk.probability import *
from nltk.text import *
# from nltk.tree import *
from nltk.util import *
from nltk.jsontags import *

# ###########################################################
# # PACKAGES
# ###########################################################

# from nltk.chunk import *
# from nltk.classify import *
# from nltk.inference import *
from nltk.metrics import *
# from nltk.parse import *
# from nltk.tag import *
from nltk.tokenize import *
from nltk.translate import *
# from nltk.sem import *
# from nltk.stem import *

# Packages which can be lazily imported
# (a) we don't import *
# (b) they're slow to import or have run-time dependencies
#     that can safely fail at run time

from nltk import lazyimport
app = lazyimport.LazyModule('nltk.app', locals(), globals())
chat = lazyimport.LazyModule('nltk.chat', locals(), globals())
corpus = lazyimport.LazyModule('nltk.corpus', locals(), globals())
draw = lazyimport.LazyModule('nltk.draw', locals(), globals())
toolbox = lazyimport.LazyModule('nltk.toolbox', locals(), globals())

# Optional loading

try:
    import numpy
except ImportError:
    pass
else:
    from nltk import cluster

# from nltk.downloader import download, download_shell
# try:
#     from six.moves import tkinter
# except ImportError:
#     pass
# else:
#     try:
#         from nltk.downloader import download_gui
#     except RuntimeError as e:
#         import warnings
#         warnings.warn("Corpus downloader GUI not loaded "
#                       "(RuntimeError during import: %s)" % str(e))

# explicitly import all top-level modules (ensuring
# they override the same names inadvertently imported
# from a subpackage)

# from nltk import ccg, chunk, classify, collocations
# from nltk import data, featstruct, grammar, help, inference, metrics
# from nltk import misc, parse, probability, sem, stem, wsd
# from nltk import tag, tbl, text, tokenize, translate, tree, treetransforms, util


Interestingly, all of the disabled imports ultimately lead back to importing tkinter, which I think is the root cause. If I replace import nltk with import tkinter in the test script, I get a very similar crash report, both referencing tkinter.

From what I can tell, these packages directly import tkinter:

  • nltk.app
  • nltk.draw
  • nltk.sem

From the above changes to the main package __init__, these are the problematic imports, and how they trace back to importing tkinter

  • nltk.featstruct (sem)
  • nltk.grammar (featstruct)
  • nltk.tree (grammar)
  • nltk.chunk (chunk.named_entity > tree)
  • nltk.parse (parse.bllip > tree)
  • nltk.tag (tag.stanford > parse)
  • nltk.classify (classify.senna > tag)
  • nltk.inference (inference.discourse > sem, tag)
  • nltk.stem (stem.snowball > corpus > corpus.reader.timit > tree)

All 22 comments

Do you get any exceptions?

no. i put a try .. except: clause around the import urllib2; print... but got nothing from it.

I'm running into the exact same problem. I just opened a SO question that may be useful to be linked here: http://stackoverflow.com/questions/30766419/python-child-process-silently-crashes-when-issuing-an-http-request

The child process is indeed crashing silently without further notice.

I disagree with you @oxymor0n, this seems quite a serious issue to me. This basically means that whenever nltk is imported, there is no way to issue a request from a child process which can be really annoying when working with APIs.

The child process is indeed crashing silently without further notice.

We are also experiencing this issue with the combination of: nltk, gunicorn (with nltk loaded via prefork), and flask.

Remove the nltk import, and everything works. Except nltk.

/cc @escherba

@ninowalker, @oxymor0n It's strange, my processes runs fine with the code, I get:

Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned

That's the expected output, right?

It doesn't break my request with this too:

alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing import Process
>>> import requests
>>> from pprint import pprint
>>> Process(target=lambda: pprint(
...         requests.get('https://api.github.com'))).start()
>>> <Response [200]>

>>> import nltk
>>> Process(target=lambda: pprint(
...         requests.get('https://api.github.com'))).start()
>>> <Response [200]>

I'm using:

  • python 2.7.6
  • nltk 3.0.3
  • ubuntu 14.04

I ran into the same problem that @Hiestaa had. I have helper file string_util.python that import nltk, but it is not used in the main python file that uses multiprocessing module to start a multi-process crawler. The symptom is that the child process just gets stuck and there is no error message (not even exception message).
After commenting out nltk-related imports and functions, the problem got resolve.

Details:
OS: Yosemite 10.10.5
Python: 2.7.10
Retrieve page content: I used urllib2 initially, then switched to requests later.

This is a very serious bug, and I hope somebody can step in and fix it. Thanks!

I think this is a serious problem if you are doing production level NLP. We are using Rq(http://python-rq.org/) workers, to run multiple NLP pipelines, wich gets silently killed when doing network calls. Hope there will be a fix soon. Thanks!

@sasinda: You might like to put out a call on the nltk-dev mailing list to see if you can get some attention to this issue.

@sasinda I'm not sure how Rq works exactly but in my production level NLP project I managed to workaround this issue by starting each process in a separated and isolated python interpreter, using a shell script to spawn them on start-up. In this case python never has to fork and the silent crash from nltk never happens. Maybe this can help in the meantime.

I've found that performing the import at function level avoids the issue.

In other words, this works:

def split(words):
    import nltk
    return nltk.word_tokenize(words)

and this doesn't:

import nltk
def split(words):
    return nltk.word_tokenize(words)

Thanks @mpenkov. Does this resolve the issue?

@stevenbird I don't think so. It's a workaround, but it isn't a fix.

IMHO, if importing a third-party library breaks a Python standard library component, something unholy is happening somewhere, and needs to be fixed.

@mpenkov I'm not entirely sure why this works, but here's another workaround I found works. Building an opener in the parent process appears to fix it. Modifying @oxymor0n's original code:

from multiprocessing import Process
import nltk
import time
import urllib2

# HACK
urllib2.build_opener(urllib2.HTTPHandler())

def child_fn():
    print "Fetch URL"
    import urllib2
    print urllib2.urlopen("https://www.google.com").read()[:100]
    print "Done"


while True:
    child_process = Process(target=child_fn)
    child_process.start()
    child_process.join()
    print "Child process returned"
    time.sleep(1)

@mpenkov @ninowalker, @oxymor0n @sasinda @wenbowang Do you all still face the same issue?

I couldn't replicate the problem on my machine:

from multiprocessing import Process
import nltk
import time

def child_fn():
    print "Fetch URL"
    import urllib2
    print urllib2.urlopen("https://www.google.com").read()[:100]
    print "Done"


while True:
    child_process = Process(target=child_fn)
    child_process.start()
    child_process.join()
    print "Child process returned"
    time.sleep(1)

gives me:

Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-SG"><head><meta cont
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-SG"><head><meta cont
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-SG"><head><meta cont
Done

I'm on:

  • Python 2.7.12
  • NLTK 3.2.5
  • Ubuntu 16.04.3

@alvations It has been a long time since I found this issue.
I even forgot which project base was having this issue, so I couldn't tell you whether I still have the problem or not.
Sorry!

@alvations I too don't remember which of my projects suffered from this specific issues.

I ran your code on my machine and couldn't replicate the problem.

Python 2.7.12
nltk 3.2.1
macOS 10.12.6

@alvations I too am not working on that project anymore. But used one of those workarounds.
I tried your code but still the child process exits with segment fault (exit code 11) for me (exits at line: urllib2.urlopen("https://www.google.com").read()[:100])

It worked with urllib3(https://urllib3.readthedocs.io/en/latest/) though.

  • nltk (3.2.5)
  • urllib3 (1.22)
  • Mac OSX 10.12.16
  • Python 2.7.13 |Continuum Analytics, Inc.| (default, Dec 20 2016, 23:05:08)
    [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin

As far as I can tell, this issue seems to affect macOS. Using Python 3.6 so far,

  • macOS 10.13 (fails)
  • Centos 7.2 (succeeds)
  • Ubuntu 16.04 (succeeds)

Modified OP's script for python3:

from multiprocessing import Process
import nltk
import time


def child_fn():
    from urllib.request import urlopen
    print("Fetch URL")
    print(urlopen("https://www.google.com").read()[:100])
    print("Done")


child_process = Process(target=child_fn)
child_process.start()
child_process.join()
print("Child process returned")
time.sleep(1)

Output:

Fetch URL
Child process returned

The subprocess quits unexpectedly, receiving similar output to what's seen in this Stack Overflow post.

I think this is quite mind boggling. It might has something to do with threads handling on MacOS.

I'm not incredibly familiar with nltk, but I did a little blind poking around to see what caused the test to pass/fail. Here's what I had to do to the package __init__.py in order to make the test pass:


Details (click to expand)

###########################################################
# TOP-LEVEL MODULES
###########################################################

# Import top-level functionality into top-level namespace

from nltk.collocations import *
from nltk.decorators import decorator, memoize
# from nltk.featstruct import *
# from nltk.grammar import *
from nltk.probability import *
from nltk.text import *
# from nltk.tree import *
from nltk.util import *
from nltk.jsontags import *

# ###########################################################
# # PACKAGES
# ###########################################################

# from nltk.chunk import *
# from nltk.classify import *
# from nltk.inference import *
from nltk.metrics import *
# from nltk.parse import *
# from nltk.tag import *
from nltk.tokenize import *
from nltk.translate import *
# from nltk.sem import *
# from nltk.stem import *

# Packages which can be lazily imported
# (a) we don't import *
# (b) they're slow to import or have run-time dependencies
#     that can safely fail at run time

from nltk import lazyimport
app = lazyimport.LazyModule('nltk.app', locals(), globals())
chat = lazyimport.LazyModule('nltk.chat', locals(), globals())
corpus = lazyimport.LazyModule('nltk.corpus', locals(), globals())
draw = lazyimport.LazyModule('nltk.draw', locals(), globals())
toolbox = lazyimport.LazyModule('nltk.toolbox', locals(), globals())

# Optional loading

try:
    import numpy
except ImportError:
    pass
else:
    from nltk import cluster

# from nltk.downloader import download, download_shell
# try:
#     from six.moves import tkinter
# except ImportError:
#     pass
# else:
#     try:
#         from nltk.downloader import download_gui
#     except RuntimeError as e:
#         import warnings
#         warnings.warn("Corpus downloader GUI not loaded "
#                       "(RuntimeError during import: %s)" % str(e))

# explicitly import all top-level modules (ensuring
# they override the same names inadvertently imported
# from a subpackage)

# from nltk import ccg, chunk, classify, collocations
# from nltk import data, featstruct, grammar, help, inference, metrics
# from nltk import misc, parse, probability, sem, stem, wsd
# from nltk import tag, tbl, text, tokenize, translate, tree, treetransforms, util


Interestingly, all of the disabled imports ultimately lead back to importing tkinter, which I think is the root cause. If I replace import nltk with import tkinter in the test script, I get a very similar crash report, both referencing tkinter.

From what I can tell, these packages directly import tkinter:

  • nltk.app
  • nltk.draw
  • nltk.sem

From the above changes to the main package __init__, these are the problematic imports, and how they trace back to importing tkinter

  • nltk.featstruct (sem)
  • nltk.grammar (featstruct)
  • nltk.tree (grammar)
  • nltk.chunk (chunk.named_entity > tree)
  • nltk.parse (parse.bllip > tree)
  • nltk.tag (tag.stanford > parse)
  • nltk.classify (classify.senna > tag)
  • nltk.inference (inference.discourse > sem, tag)
  • nltk.stem (stem.snowball > corpus > corpus.reader.timit > tree)

Thanks @rpkilby, that's very helpful!

It looks like this problem https://stackoverflow.com/questions/16745507/tkinter-how-to-use-threads-to-preventing-main-event-loop-from-freezing

I think tinkter has been a pain point for us for quite some time. Perhaps, it'll be good if we can find an alternative to it.

I agree. A shorter-term solution would be to bury the tkinter imports inside the classes and methods that need tkinter, and avoid importing it by programs that don't need it. We've already done something similar for numpy.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zdog234 picture zdog234  ·  3Comments

DavidNemeskey picture DavidNemeskey  ·  4Comments

alvations picture alvations  ·  4Comments

libingnan54321 picture libingnan54321  ·  3Comments

mwess picture mwess  ·  5Comments