I'm running into an issue when combining multiprocessing, requests (or urllib2) and nltk. Here is a very simple code:
>>> from multiprocessing import Process
>>> import requests
>>> from pprint import pprint
>>> Process(target=lambda: pprint(
requests.get('https://api.github.com'))).start()
>>> <Response [200]> # this is the response displayed by the call to `pprint`.
A bit more details on what this piece of code does:
- Import a few required modules
- Start a child process
- Issue an HTTP GET request to 'api.github.com' from the child process
- Display the result
This is working great. The problem comes when importing nltk:
>>> import nltk
>>> Process(target=lambda: pprint(
requests.get('https://api.github.com'))).start()
>>> # nothing happens!
After having imported NLTK, the requests actually silently crashes the thread (if you try with a named function instead of the lambda function, adding a few print
statement before and after the call, you'll see that the execution stops right on the call to requests.get
)
Does anybody have any idea what in NLTK could explain such behavior, and how to get overcome the issue?
Here are the version I'm using:
$> python --version
Python 2.7.5
$> pip freeze | grep nltk
nltk==2.0.5
$> pip freeze | grep requests
requests==2.2.1
I'm running Mac OS X v. 10.9.5.
Thanks!
Updating your python libraries and python should resolve the problem:
alvas@ubi:~$ pip freeze | grep nltk
nltk==3.0.3
alvas@ubi:~$ pip freeze | grep requests
requests==2.7.0
alvas@ubi:~$ python --version
Python 2.7.6
alvas@ubi:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty
From code:
from multiprocessing import Process
import nltk
import time
def child_fn():
print "Fetch URL"
import urllib2
print urllib2.urlopen("https://www.google.com").read()[:100]
print "Done"
while True:
child_process = Process(target=child_fn)
child_process.start()
child_process.join()
print "Child process returned"
time.sleep(1)
[out]:
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
Fetch URL
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content
Done
Child process returned
From code:
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing import Process
>>> import requests
>>> from pprint import pprint
>>> Process(target=lambda: pprint(
... requests.get('https://api.github.com'))).start()
>>> <Response [200]>
>>> import nltk
>>> Process(target=lambda: pprint(
... requests.get('https://api.github.com'))).start()
>>> <Response [200]>
It should work with python3
too:
alvas@ubi:~$ python3
Python 3.4.0 (default, Jun 19 2015, 14:20:21)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from multiprocessing import Process
>>> import requests
>>> Process(target=lambda: print(requests.get('https://api.github.com'))).start()
>>>
>>> <Response [200]>
>>> import nltk
>>> Process(target=lambda: print(requests.get('https://api.github.com'))).start()
>>> <Response [200]>
It seems using Nltk and Python Requests in a child process is rare. Try using Thread instead of Process, I was having exactly same issue with some other library and Requests and replacing Process with Thread worked for me.
This issue still seems not solved.
https://github.com/nltk/nltk/issues/947
I think this is a serious issue (unless you are playing with NLTK, doing POCs and trying out models, not actual apps)
I am running the NLP pipelines in RQ workers (http://python-rq.org/)
nltk==3.2.1
requests==2.9.1