Nltk: panlex_lite installation via nltk.download() appears to fail

Created on 22 Mar 2016  ·  32Comments  ·  Source: nltk/nltk

Platform: Python 3.5 on Mac OS X 10.11.2
Steps to reproduce:

  1. $ python3
  2. >>> import nltk; nltk.download('all', halt_on_error=False)

Symptoms:
# Partial console write:
[nltk_data] | Downloading package panlex_lite to
[nltk_data] | /Users/beng/nltk_data...
[nltk_data] | Unzipping corpora/panlex_lite.zip.
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download
for msg in self.incr_download(info_or_id, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 543, in incr_download
for msg in self.incr_download(info.children, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 529, in incr_download
for msg in self._download_list(info_or_id, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 572, in _download_list
for msg in self.incr_download(item, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download
for msg in self._download_package(info, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package
for msg in _unzip_iter(filepath, zipdir, verbose=False):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter
outfile.write(contents)
OSError: [Errno 22] Invalid argument

Most helpful comment

Hi, same here, hopefully if enough people report it then it's going to get fixed at some point ...

All 32 comments

@grayben – would you please install the current version of NLTK and report if you still have this issue?

@stevenbird sorry for my delay in replying - you know how uni assignments can be!
I was experiencing the issue on v3.2. I just upgraded to v3.2.1 and am having the same issue.

@grayben How did you install NLTK? Do you have an error when downloading a single corpus, e.g. nltk.download('brown')? Do you have an error when using Python2.7?

@alvations

  1. I installed NLTK for python2 and python3 via pip and pip3, respectively.
  2. I do not have the error when downloading a single corpus which is not panlex_lite
  3. The error has occurred using either Python2.7 or Python3.5

Additional information: a number of my classmates have reported what appears to be the same problem, though I can't comment on their configurations or exactly what they did to encounter the issue.

@grayben could you run the following lines of code and see whether you get the same [0, 448887900, 85839474] output?

>>> import zipfile
>>> plzip = '/Users/beng/nltk_data/corpora/panlex_lite.zip'
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]

On the command line outside python, what is the output for the following?:

$ ls -lah /Users/beng//nltk_data/corpora/

Your code -> my output:

>>> import zipfile
>>> plzip = ' /Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1009, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: ' /Users/beng//nltk_data/corpora/panlex_lite.zip'

I then changed ' /Users/beng//nltk_data/corpora/panlex_lite.zip' to '/Users/beng//nltk_data/corpora/panlex_lite.zip' (no space before root):

>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1026, in __init__
    self._RealGetContents()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1093, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Bens-MacBook-Pro:10K-Extractor beng$ ls -lah /Users/beng//nltk_data/corpora/
total 966608
drwxr-xr-x   152 beng  staff   5.0K 19 Apr 16:26 .
drwxr-xr-x    11 beng  staff   374B  3 Mar 14:41 ..
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:32 abc
-rw-r--r--     1 beng  staff   1.4M  3 Mar 14:32 abc.zip
drwxr-xr-x     4 beng  staff   136B  3 Mar 14:32 alpino
-rw-r--r--     1 beng  staff   2.7M  3 Mar 14:32 alpino.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:32 biocreative_ppi
-rw-r--r--     1 beng  staff   218K  3 Mar 14:32 biocreative_ppi.zip
drwxr-xr-x   505 beng  staff    17K  3 Mar 14:32 brown
-rw-r--r--     1 beng  staff   3.2M  3 Mar 14:32 brown.zip
drwxr-xr-x   509 beng  staff    17K  3 Mar 14:32 brown_tei
-rw-r--r--     1 beng  staff   8.3M  3 Mar 14:32 brown_tei.zip
drwxr-xr-x  1389 beng  staff    46K  3 Mar 14:33 cess_cat
-rw-r--r--     1 beng  staff   5.1M  3 Mar 14:33 cess_cat.zip
drwxr-xr-x   612 beng  staff    20K  3 Mar 14:33 cess_esp
-rw-r--r--     1 beng  staff   2.1M  3 Mar 14:33 cess_esp.zip
drwxr-xr-x    10 beng  staff   340B  3 Mar 14:33 chat80
-rw-r--r--     1 beng  staff    19K  3 Mar 14:33 chat80.zip
drwxr-xr-x     3 beng  staff   102B  3 Mar 14:33 city_database
-rw-r--r--     1 beng  staff   1.7K  3 Mar 14:33 city_database.zip
drwxr-xr-x     4 beng  staff   136B  3 Mar 14:33 cmudict
-rw-r--r--     1 beng  staff   875K  3 Mar 14:33 cmudict.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:33 comparative_sentences
-rw-r--r--     1 beng  staff   273K  3 Mar 14:33 comparative_sentences.zip
-rw-r--r--     1 beng  staff    11M  3 Mar 14:33 comtrans.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:33 conll2000
-rw-r--r--     1 beng  staff   739K  3 Mar 14:33 conll2000.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:33 conll2002
-rw-r--r--     1 beng  staff   1.8M  3 Mar 14:33 conll2002.zip
-rw-r--r--     1 beng  staff   1.2M  3 Mar 14:33 conll2007.zip
drwxr-xr-x   453 beng  staff    15K  3 Mar 14:33 crubadan
-rw-r--r--     1 beng  staff   5.0M  3 Mar 14:33 crubadan.zip
drwxr-xr-x   201 beng  staff   6.7K  3 Mar 14:33 dependency_treebank
-rw-r--r--     1 beng  staff   447K  3 Mar 14:33 dependency_treebank.zip
drwxr-xr-x    14 beng  staff   476B  3 Mar 14:33 europarl_raw
-rw-r--r--     1 beng  staff    12M  3 Mar 14:33 europarl_raw.zip
drwxr-xr-x     4 beng  staff   136B  3 Mar 14:33 floresta
-rw-r--r--     1 beng  staff   1.8M  3 Mar 14:33 floresta.zip
drwxr-xr-x    16 beng  staff   544B  3 Mar 14:34 framenet_v15
-rw-r--r--     1 beng  staff    66M  3 Mar 14:33 framenet_v15.zip
drwxr-xr-x    11 beng  staff   374B  3 Mar 14:34 gazetteers
-rw-r--r--     1 beng  staff   8.1K  3 Mar 14:34 gazetteers.zip
drwxr-xr-x    11 beng  staff   374B  3 Mar 14:34 genesis
-rw-r--r--     1 beng  staff   462K  3 Mar 14:34 genesis.zip
drwxr-xr-x    21 beng  staff   714B  3 Mar 14:34 gutenberg
-rw-r--r--     1 beng  staff   4.1M  3 Mar 14:34 gutenberg.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:34 ieer
-rw-r--r--     1 beng  staff   162K  3 Mar 14:34 ieer.zip
drwxr-xr-x    59 beng  staff   2.0K  3 Mar 14:34 inaugural
-rw-r--r--     1 beng  staff   314K  3 Mar 14:34 inaugural.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:34 indian
-rw-r--r--     1 beng  staff   195K  3 Mar 14:34 indian.zip
-rw-r--r--     1 beng  staff    16M  3 Mar 14:34 jeita.zip
drwxr-xr-x    22 beng  staff   748B  3 Mar 14:34 kimmo
-rw-r--r--     1 beng  staff   183K  3 Mar 14:34 kimmo.zip
-rw-r--r--     1 beng  staff   8.4M  3 Mar 14:34 knbc.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:34 lin_thesaurus
-rw-r--r--     1 beng  staff    85M  3 Mar 14:34 lin_thesaurus.zip
drwxr-xr-x   112 beng  staff   3.7K  3 Mar 14:34 mac_morpho
-rw-r--r--     1 beng  staff   2.9M  3 Mar 14:34 mac_morpho.zip
-rw-r--r--     1 beng  staff   5.9M  3 Mar 14:34 machado.zip
-rw-r--r--     1 beng  staff   1.5M  3 Mar 14:34 masc_tagged.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:34 movie_reviews
-rw-r--r--     1 beng  staff   3.8M  3 Mar 14:34 movie_reviews.zip
drwxr-xr-x    56 beng  staff   1.9K  3 Mar 14:38 mte_teip5
-rw-r--r--     1 beng  staff    14M  3 Mar 14:38 mte_teip5.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:34 names
-rw-r--r--     1 beng  staff    21K  3 Mar 14:34 names.zip
-rw-r--r--     1 beng  staff   6.4M  3 Mar 14:35 nombank.1.0.zip
drwxr-xr-x    19 beng  staff   646B  3 Mar 14:35 nps_chat
-rw-r--r--     1 beng  staff   294K  3 Mar 14:35 nps_chat.zip
drwxr-xr-x    32 beng  staff   1.1K  3 Mar 14:35 omw
-rw-r--r--     1 beng  staff    11M  3 Mar 14:35 omw.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 opinion_lexicon
-rw-r--r--     1 beng  staff    24K  3 Mar 14:35 opinion_lexicon.zip
drwxr-xr-x     4 beng  staff   136B 21 Mar 17:54 panlex_lite
-rw-r--r--     1 beng  staff    58M 19 Apr 16:28 panlex_lite.zip
-rw-r--r--     1 beng  staff   2.6M  3 Mar 14:37 panlex_swadesh.zip
drwxr-xr-x    21 beng  staff   714B  3 Mar 14:35 paradigms
-rw-r--r--     1 beng  staff    24K  3 Mar 14:35 paradigms.zip
drwxr-xr-x   475 beng  staff    16K  3 Mar 14:35 pil
-rw-r--r--     1 beng  staff   1.4M  3 Mar 14:35 pil.zip
drwxr-xr-x    16 beng  staff   544B  3 Mar 14:35 pl196x
-rw-r--r--     1 beng  staff   6.7M  3 Mar 14:35 pl196x.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:35 ppattach
-rw-r--r--     1 beng  staff   763K  3 Mar 14:35 ppattach.zip
drwxr-xr-x     8 beng  staff   272B  3 Mar 14:35 problem_reports
-rw-r--r--     1 beng  staff   1.0M  3 Mar 14:35 problem_reports.zip
drwxr-xr-x     8 beng  staff   272B  3 Mar 14:35 product_reviews_1
-rw-r--r--     1 beng  staff   138K  3 Mar 14:35 product_reviews_1.zip
drwxr-xr-x    12 beng  staff   408B  3 Mar 14:35 product_reviews_2
-rw-r--r--     1 beng  staff   167K  3 Mar 14:35 product_reviews_2.zip
-rw-r--r--     1 beng  staff   5.1M  3 Mar 14:35 propbank.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 pros_cons
-rw-r--r--     1 beng  staff   729K  3 Mar 14:35 pros_cons.zip
drwxr-xr-x     3 beng  staff   102B  3 Mar 14:35 ptb
-rw-r--r--     1 beng  staff   6.1K  3 Mar 14:35 ptb.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 qc
-rw-r--r--     1 beng  staff   123K  3 Mar 14:35 qc.zip
-rw-r--r--     1 beng  staff   6.1M  3 Mar 14:35 reuters.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:35 rte
-rw-r--r--     1 beng  staff   377K  3 Mar 14:35 rte.zip
-rw-r--r--     1 beng  staff   4.2M  3 Mar 14:35 semcor.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:35 senseval
-rw-r--r--     1 beng  staff   2.1M  3 Mar 14:35 senseval.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 sentence_polarity
-rw-r--r--     1 beng  staff   479K  3 Mar 14:35 sentence_polarity.zip
drwxr-xr-x     4 beng  staff   136B  3 Mar 14:35 sentiwordnet
-rw-r--r--     1 beng  staff   4.5M  3 Mar 14:35 sentiwordnet.zip
drwxr-xr-x    13 beng  staff   442B  3 Mar 14:35 shakespeare
-rw-r--r--     1 beng  staff   464K  3 Mar 14:35 shakespeare.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 sinica_treebank
-rw-r--r--     1 beng  staff   878K  3 Mar 14:35 sinica_treebank.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:35 smultron
-rw-r--r--     1 beng  staff   162K  3 Mar 14:35 smultron.zip
drwxr-xr-x    68 beng  staff   2.3K  3 Mar 14:35 state_union
-rw-r--r--     1 beng  staff   790K  3 Mar 14:35 state_union.zip
drwxr-xr-x    17 beng  staff   578B  3 Mar 14:35 stopwords
-rw-r--r--     1 beng  staff   8.9K  3 Mar 14:35 stopwords.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 subjectivity
-rw-r--r--     1 beng  staff   509K  3 Mar 14:35 subjectivity.zip
drwxr-xr-x    27 beng  staff   918B  3 Mar 14:35 swadesh
-rw-r--r--     1 beng  staff    22K  3 Mar 14:35 swadesh.zip
drwxr-xr-x     8 beng  staff   272B  3 Mar 14:35 switchboard
-rw-r--r--     1 beng  staff   773K  3 Mar 14:35 switchboard.zip
drwxr-xr-x    39 beng  staff   1.3K  3 Mar 14:35 timit
-rw-r--r--     1 beng  staff    21M  3 Mar 14:35 timit.zip
drwxr-xr-x     8 beng  staff   272B  3 Mar 14:35 toolbox
-rw-r--r--     1 beng  staff   245K  3 Mar 14:35 toolbox.zip
drwxr-xr-x    12 beng  staff   408B  3 Mar 14:36 treebank
-rw-r--r--     1 beng  staff   1.6M  3 Mar 14:36 treebank.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:36 twitter_samples
-rw-r--r--     1 beng  staff    15M  3 Mar 14:36 twitter_samples.zip
drwxr-xr-x   337 beng  staff    11K  3 Mar 14:36 udhr
-rw-r--r--     1 beng  staff   1.1M  3 Mar 14:36 udhr.zip
drwxr-xr-x   390 beng  staff    13K  3 Mar 14:36 udhr2
-rw-r--r--     1 beng  staff   1.6M  3 Mar 14:36 udhr2.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:36 unicode_samples
-rw-r--r--     1 beng  staff   1.2K  3 Mar 14:36 unicode_samples.zip
-rw-r--r--     1 beng  staff    25M  3 Mar 14:36 universal_treebanks_v20.zip
drwxr-xr-x   242 beng  staff   8.0K  3 Mar 14:36 verbnet
-rw-r--r--     1 beng  staff   316K  3 Mar 14:36 verbnet.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:36 webtext
-rw-r--r--     1 beng  staff   631K  3 Mar 14:36 webtext.zip
drwxr-xr-x    20 beng  staff   680B  3 Mar 14:36 wordnet
-rw-r--r--     1 beng  staff    10M  3 Mar 14:36 wordnet.zip
drwxr-xr-x    30 beng  staff   1.0K  3 Mar 14:36 wordnet_ic
-rw-r--r--     1 beng  staff    11M  3 Mar 14:36 wordnet_ic.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:36 words
-rw-r--r--     1 beng  staff   740K  3 Mar 14:36 words.zip
drwxr-xr-x     3 beng  staff   102B  3 Mar 14:36 ycoe
-rw-r--r--     1 beng  staff   477B  3 Mar 14:36 ycoe.zip

This suggests that when downloading, the file gets corrupted (possibly due to broken internet connection):

>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1026, in __init__
    self._RealGetContents()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1093, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Go to '/Users/beng//nltk_data/corpora/, delete the panlex_lite.zip file and then re-download it again. Note it might take up to 2+ hours or more to download that zipfile when the server is overloaded or your internet connection is slow.

I did the following (three times):

  1. rm /Users/beng//nltk_data/corpora/panlex_lite.zip
  2. python3
  3. The following Python commands:
>>> import nltk
>>> nltk.download('panlex_lite')
[nltk_data] Downloading package panlex_lite to
[nltk_data]     /Users/beng/nltk_data...
[nltk_data]   Unzipping corpora/panlex_lite.zip.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download
    for msg in self._download_package(info, download_dir, force):
  File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package
    for msg in _unzip_iter(filepath, zipdir, verbose=False):
  File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter
    outfile.write(contents)
OSError: [Errno 22] Invalid argument
>>> 

However, please also note the following command input/output:

>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> import zipfile
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]

Can you also do rm -rf /Users/beng//nltk_data/corpora/panlex_lite before running the python3?

i.e.:

$ rm /Users/beng//nltk_data/corpora/panlex_lite.zip
$ rm -rf /Users/beng//nltk_data/corpora/panlex_lite
$ python -m nltk.downloader panlex_lite
$ python3
>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> import zipfile
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]

I couldn't reproduce your OSError on Ubuntu 14.04 Python 3.5.1:

alvas@ubi:~/nltk_data/corpora$ ls panlex_
panlex_lite.zip     panlex_swadesh.zip  
alvas@ubi:~/nltk_data/corpora$ cd
alvas@ubi:~$ python
Python 2.7.11 (default, Dec 15 2015, 16:46:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> nltk.download('panlex_lite')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'nltk' is not defined
>>> import nltk
>>> nltk.download('panlex_lite')
[nltk_data] Downloading package panlex_lite to
[nltk_data]     /home/alvas/nltk_data...
[nltk_data]   Package panlex_lite is already up-to-date!
True
>>> exit()
alvas@ubi:~$ python3
Python 3.5.1 (default, Dec 18 2015, 00:00:00) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('panlex_lite')
[nltk_data] Downloading package panlex_lite to
[nltk_data]     /home/alvas/nltk_data...
[nltk_data]   Package panlex_lite is already up-to-date!
True

BTW, if you're not going to use panlex, the rest of NLTK will work just fine without it =)

Bens-MacBook-Pro:work beng$ rm -rf /Users/beng//nltk_data/corpora/panlex_lite
Bens-MacBook-Pro:work beng$ ls -lah /Users/beng//nltk_data/corpora
total 4361152
drwxr-xr-x   151 beng  staff   5.0K 20 Apr 13:12 .
drwxr-xr-x    11 beng  staff   374B  3 Mar 14:41 ..
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:32 abc
-rw-r--r--     1 beng  staff   1.4M  3 Mar 14:32 abc.zip
drwxr-xr-x     4 beng  staff   136B  3 Mar 14:32 alpino
-rw-r--r--     1 beng  staff   2.7M  3 Mar 14:32 alpino.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:32 biocreative_ppi
-rw-r--r--     1 beng  staff   218K  3 Mar 14:32 biocreative_ppi.zip
drwxr-xr-x   505 beng  staff    17K  3 Mar 14:32 brown
-rw-r--r--     1 beng  staff   3.2M  3 Mar 14:32 brown.zip
drwxr-xr-x   509 beng  staff    17K  3 Mar 14:32 brown_tei
-rw-r--r--     1 beng  staff   8.3M  3 Mar 14:32 brown_tei.zip
drwxr-xr-x  1389 beng  staff    46K  3 Mar 14:33 cess_cat
-rw-r--r--     1 beng  staff   5.1M  3 Mar 14:33 cess_cat.zip
drwxr-xr-x   612 beng  staff    20K  3 Mar 14:33 cess_esp
-rw-r--r--     1 beng  staff   2.1M  3 Mar 14:33 cess_esp.zip
drwxr-xr-x    10 beng  staff   340B  3 Mar 14:33 chat80
-rw-r--r--     1 beng  staff    19K  3 Mar 14:33 chat80.zip
drwxr-xr-x     3 beng  staff   102B  3 Mar 14:33 city_database
-rw-r--r--     1 beng  staff   1.7K  3 Mar 14:33 city_database.zip
drwxr-xr-x     4 beng  staff   136B  3 Mar 14:33 cmudict
-rw-r--r--     1 beng  staff   875K  3 Mar 14:33 cmudict.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:33 comparative_sentences
-rw-r--r--     1 beng  staff   273K  3 Mar 14:33 comparative_sentences.zip
-rw-r--r--     1 beng  staff    11M  3 Mar 14:33 comtrans.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:33 conll2000
-rw-r--r--     1 beng  staff   739K  3 Mar 14:33 conll2000.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:33 conll2002
-rw-r--r--     1 beng  staff   1.8M  3 Mar 14:33 conll2002.zip
-rw-r--r--     1 beng  staff   1.2M  3 Mar 14:33 conll2007.zip
drwxr-xr-x   453 beng  staff    15K  3 Mar 14:33 crubadan
-rw-r--r--     1 beng  staff   5.0M  3 Mar 14:33 crubadan.zip
drwxr-xr-x   201 beng  staff   6.7K  3 Mar 14:33 dependency_treebank
-rw-r--r--     1 beng  staff   447K  3 Mar 14:33 dependency_treebank.zip
drwxr-xr-x    14 beng  staff   476B  3 Mar 14:33 europarl_raw
-rw-r--r--     1 beng  staff    12M  3 Mar 14:33 europarl_raw.zip
drwxr-xr-x     4 beng  staff   136B  3 Mar 14:33 floresta
-rw-r--r--     1 beng  staff   1.8M  3 Mar 14:33 floresta.zip
drwxr-xr-x    16 beng  staff   544B  3 Mar 14:34 framenet_v15
-rw-r--r--     1 beng  staff    66M  3 Mar 14:33 framenet_v15.zip
drwxr-xr-x    11 beng  staff   374B  3 Mar 14:34 gazetteers
-rw-r--r--     1 beng  staff   8.1K  3 Mar 14:34 gazetteers.zip
drwxr-xr-x    11 beng  staff   374B  3 Mar 14:34 genesis
-rw-r--r--     1 beng  staff   462K  3 Mar 14:34 genesis.zip
drwxr-xr-x    21 beng  staff   714B  3 Mar 14:34 gutenberg
-rw-r--r--     1 beng  staff   4.1M  3 Mar 14:34 gutenberg.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:34 ieer
-rw-r--r--     1 beng  staff   162K  3 Mar 14:34 ieer.zip
drwxr-xr-x    59 beng  staff   2.0K  3 Mar 14:34 inaugural
-rw-r--r--     1 beng  staff   314K  3 Mar 14:34 inaugural.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:34 indian
-rw-r--r--     1 beng  staff   195K  3 Mar 14:34 indian.zip
-rw-r--r--     1 beng  staff    16M  3 Mar 14:34 jeita.zip
drwxr-xr-x    22 beng  staff   748B  3 Mar 14:34 kimmo
-rw-r--r--     1 beng  staff   183K  3 Mar 14:34 kimmo.zip
-rw-r--r--     1 beng  staff   8.4M  3 Mar 14:34 knbc.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:34 lin_thesaurus
-rw-r--r--     1 beng  staff    85M  3 Mar 14:34 lin_thesaurus.zip
drwxr-xr-x   112 beng  staff   3.7K  3 Mar 14:34 mac_morpho
-rw-r--r--     1 beng  staff   2.9M  3 Mar 14:34 mac_morpho.zip
-rw-r--r--     1 beng  staff   5.9M  3 Mar 14:34 machado.zip
-rw-r--r--     1 beng  staff   1.5M  3 Mar 14:34 masc_tagged.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:34 movie_reviews
-rw-r--r--     1 beng  staff   3.8M  3 Mar 14:34 movie_reviews.zip
drwxr-xr-x    56 beng  staff   1.9K  3 Mar 14:38 mte_teip5
-rw-r--r--     1 beng  staff    14M  3 Mar 14:38 mte_teip5.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:34 names
-rw-r--r--     1 beng  staff    21K  3 Mar 14:34 names.zip
-rw-r--r--     1 beng  staff   6.4M  3 Mar 14:35 nombank.1.0.zip
drwxr-xr-x    19 beng  staff   646B  3 Mar 14:35 nps_chat
-rw-r--r--     1 beng  staff   294K  3 Mar 14:35 nps_chat.zip
drwxr-xr-x    32 beng  staff   1.1K  3 Mar 14:35 omw
-rw-r--r--     1 beng  staff    11M  3 Mar 14:35 omw.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 opinion_lexicon
-rw-r--r--     1 beng  staff    24K  3 Mar 14:35 opinion_lexicon.zip
-rw-r--r--     1 beng  staff   1.7G 20 Apr 12:46 panlex_lite.zip
-rw-r--r--     1 beng  staff   2.6M  3 Mar 14:37 panlex_swadesh.zip
drwxr-xr-x    21 beng  staff   714B  3 Mar 14:35 paradigms
-rw-r--r--     1 beng  staff    24K  3 Mar 14:35 paradigms.zip
drwxr-xr-x   475 beng  staff    16K  3 Mar 14:35 pil
-rw-r--r--     1 beng  staff   1.4M  3 Mar 14:35 pil.zip
drwxr-xr-x    16 beng  staff   544B  3 Mar 14:35 pl196x
-rw-r--r--     1 beng  staff   6.7M  3 Mar 14:35 pl196x.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:35 ppattach
-rw-r--r--     1 beng  staff   763K  3 Mar 14:35 ppattach.zip
drwxr-xr-x     8 beng  staff   272B  3 Mar 14:35 problem_reports
-rw-r--r--     1 beng  staff   1.0M  3 Mar 14:35 problem_reports.zip
drwxr-xr-x     8 beng  staff   272B  3 Mar 14:35 product_reviews_1
-rw-r--r--     1 beng  staff   138K  3 Mar 14:35 product_reviews_1.zip
drwxr-xr-x    12 beng  staff   408B  3 Mar 14:35 product_reviews_2
-rw-r--r--     1 beng  staff   167K  3 Mar 14:35 product_reviews_2.zip
-rw-r--r--     1 beng  staff   5.1M  3 Mar 14:35 propbank.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 pros_cons
-rw-r--r--     1 beng  staff   729K  3 Mar 14:35 pros_cons.zip
drwxr-xr-x     3 beng  staff   102B  3 Mar 14:35 ptb
-rw-r--r--     1 beng  staff   6.1K  3 Mar 14:35 ptb.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 qc
-rw-r--r--     1 beng  staff   123K  3 Mar 14:35 qc.zip
-rw-r--r--     1 beng  staff   6.1M  3 Mar 14:35 reuters.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:35 rte
-rw-r--r--     1 beng  staff   377K  3 Mar 14:35 rte.zip
-rw-r--r--     1 beng  staff   4.2M  3 Mar 14:35 semcor.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:35 senseval
-rw-r--r--     1 beng  staff   2.1M  3 Mar 14:35 senseval.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 sentence_polarity
-rw-r--r--     1 beng  staff   479K  3 Mar 14:35 sentence_polarity.zip
drwxr-xr-x     4 beng  staff   136B  3 Mar 14:35 sentiwordnet
-rw-r--r--     1 beng  staff   4.5M  3 Mar 14:35 sentiwordnet.zip
drwxr-xr-x    13 beng  staff   442B  3 Mar 14:35 shakespeare
-rw-r--r--     1 beng  staff   464K  3 Mar 14:35 shakespeare.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 sinica_treebank
-rw-r--r--     1 beng  staff   878K  3 Mar 14:35 sinica_treebank.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:35 smultron
-rw-r--r--     1 beng  staff   162K  3 Mar 14:35 smultron.zip
drwxr-xr-x    68 beng  staff   2.3K  3 Mar 14:35 state_union
-rw-r--r--     1 beng  staff   790K  3 Mar 14:35 state_union.zip
drwxr-xr-x    17 beng  staff   578B  3 Mar 14:35 stopwords
-rw-r--r--     1 beng  staff   8.9K  3 Mar 14:35 stopwords.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:35 subjectivity
-rw-r--r--     1 beng  staff   509K  3 Mar 14:35 subjectivity.zip
drwxr-xr-x    27 beng  staff   918B  3 Mar 14:35 swadesh
-rw-r--r--     1 beng  staff    22K  3 Mar 14:35 swadesh.zip
drwxr-xr-x     8 beng  staff   272B  3 Mar 14:35 switchboard
-rw-r--r--     1 beng  staff   773K  3 Mar 14:35 switchboard.zip
drwxr-xr-x    39 beng  staff   1.3K  3 Mar 14:35 timit
-rw-r--r--     1 beng  staff    21M  3 Mar 14:35 timit.zip
drwxr-xr-x     8 beng  staff   272B  3 Mar 14:35 toolbox
-rw-r--r--     1 beng  staff   245K  3 Mar 14:35 toolbox.zip
drwxr-xr-x    12 beng  staff   408B  3 Mar 14:36 treebank
-rw-r--r--     1 beng  staff   1.6M  3 Mar 14:36 treebank.zip
drwxr-xr-x     7 beng  staff   238B  3 Mar 14:36 twitter_samples
-rw-r--r--     1 beng  staff    15M  3 Mar 14:36 twitter_samples.zip
drwxr-xr-x   337 beng  staff    11K  3 Mar 14:36 udhr
-rw-r--r--     1 beng  staff   1.1M  3 Mar 14:36 udhr.zip
drwxr-xr-x   390 beng  staff    13K  3 Mar 14:36 udhr2
-rw-r--r--     1 beng  staff   1.6M  3 Mar 14:36 udhr2.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:36 unicode_samples
-rw-r--r--     1 beng  staff   1.2K  3 Mar 14:36 unicode_samples.zip
-rw-r--r--     1 beng  staff    25M  3 Mar 14:36 universal_treebanks_v20.zip
drwxr-xr-x   242 beng  staff   8.0K  3 Mar 14:36 verbnet
-rw-r--r--     1 beng  staff   316K  3 Mar 14:36 verbnet.zip
drwxr-xr-x     9 beng  staff   306B  3 Mar 14:36 webtext
-rw-r--r--     1 beng  staff   631K  3 Mar 14:36 webtext.zip
drwxr-xr-x    20 beng  staff   680B  3 Mar 14:36 wordnet
-rw-r--r--     1 beng  staff    10M  3 Mar 14:36 wordnet.zip
drwxr-xr-x    30 beng  staff   1.0K  3 Mar 14:36 wordnet_ic
-rw-r--r--     1 beng  staff    11M  3 Mar 14:36 wordnet_ic.zip
drwxr-xr-x     5 beng  staff   170B  3 Mar 14:36 words
-rw-r--r--     1 beng  staff   740K  3 Mar 14:36 words.zip
drwxr-xr-x     3 beng  staff   102B  3 Mar 14:36 ycoe
-rw-r--r--     1 beng  staff   477B  3 Mar 14:36 ycoe.zip
Bens-MacBook-Pro:work beng$ python3
Python 3.5.1 (default, Mar  3 2016, 14:25:53) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('panlex_lite')
[nltk_data] Downloading package panlex_lite to
[nltk_data]     /Users/beng/nltk_data...
[nltk_data]   Package panlex_lite is already up-to-date!
True

Furthermore, through the downloader GUI, downloading "all" finally succeeds, with all fields marked "installed".

Great! So there's no OSError now? It's the broken panlex_lite directory (from previous downloads) lingering around that caused the OSError. Once the infolist of the zipfile is right, there shouldn't be a problem.

Enjoy playing around NLTK! Tell your friends/classmates to do the same too:

$ rm /Users/beng//nltk_data/corpora/panlex_lite.zip
$ rm -rf /Users/beng//nltk_data/corpora/panlex_lite
$ python -m nltk.downloader panlex_lite
$ python3
>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> import zipfile
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]

Thanks!

I get the exact same problem with latest NLTK 3.2.1 both on Ubuntu 16.04 (which crashes my whole OS) and on OSX I get the same errors as OP. I'm surprised that this case has been closed as if there was nothing wrong with it.

When trying the workaround it fails after this step, as it tried to extract it automatically right after downloading it: python -m nltk.downloader panlex_lite

[nltk_data] Downloading package panlex_lite to
[nltk_data]     /Users/houmie/nltk_data...

[nltk_data]   Unzipping corpora/panlex_lite.zip.

Traceback (most recent call last):
  File "/Users/houmie/.pyenv/versions/3.5.1/lib/python3.5/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/houmie/.pyenv/versions/3.5.1/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 2268, in <module>
    halt_on_error=options.halt_on_error)
  File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download
    for msg in self.incr_download(info_or_id, download_dir, force):
  File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download
    for msg in self._download_package(info, download_dir, force):
  File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package
    for msg in _unzip_iter(filepath, zipdir, verbose=False):
  File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter
    outfile.write(contents)
OSError: [Errno 22] Invalid argument

Thanks

@houmie what is your output for:

$ rm /Users/houmie//nltk_data/corpora/panlex_lite.zip
$ rm -rf /Users/houmie//nltk_data/corpora/panlex_lite
$ python -m nltk.downloader panlex_lite
$ python3
>>> plzip = '/Users/houmie//nltk_data/corpora/panlex_lite.zip'
>>> import zipfile
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]

This is not fixed - it's happening for python 2.7, 3.4.3, and 3.5.1. The panlex_lite download hangs for quite a while, and then unzipping freezes the GUI and/or causes the OSError.

I hit the same issue on my Macbook Pro with (OS X EI Capitain, Anaconda 1.4.0+python 3.5.2) and I tried NLTK version on both "conda install nltk" with 3.2.1 and "sudo python3 setup.py install" with github master branch. The interesting part is that I never got the CRC [0, 448887900, 85839474] but [0, 448887900, 84607019] always after I tried to download panlex_lite.zip more than 5 times. Any hint or clue?

Unfortunately they refuse the problem would even exist. I reported this in May 2016 and still no acknowledgement of the problem.

I just tried it again via the GUI download and still get this error message shown in the console:

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 655, in download
    self._interactive_download()
  File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 974, in _interactive_download
    DownloaderGUI(self).mainloop()
  File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 1709, in mainloop
    self.top.mainloop(*args, **kwargs)
  File "/Users/houmie/.pyenv/versions/3.5.1/lib/python3.5/tkinter/__init__.py", line 1131, in mainloop
    self.tk.mainloop(n)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte

This is a massive pain to me, as I need to go through the code and delete all the references to Pantex in order to get the packages working.

Hi, same here, hopefully if enough people report it then it's going to get fixed at some point ...

okay there, here's what I've done

d = nltk.downloader.Downloader()
d._packages.pop('panlex_lite')
d.download()

# error message
d._packages.pop('panlex_lite')
/usr/local/lib/python3.5/site-packages/nltk/downloader.py in info(self, id)
    876         if id in self._packages: return self._packages[id]
    877         if id in self._collections: return self._collections[id]
--> 878         raise ValueError('Package %r not found in index' % id)
    879
    880     def xmlinfo(self, id):

I guess, we could add something like if id != 'panlex_lite' to the code...

But, as for me, the easiest way looks like this:

Aaaaaaand.... Done downloading collection all! 🎉🎉🎉🎉

@demidovakatya

I'd like to understand that you mention that

that means

<package author="David Kamholz" checksum="e13211688738201c0a5bd5b2f50e94ab" id="panlex_lite" license="CC0 1.0 Universal" name="PanLex Lite Corpus" size="2202492316" subdir="corpora" unzip="1" unzipped_size="5778483185" url="http://dev.panlex.org/db/panlex_lite.zip" webpage="http://panlex.org/" />
<package author="Jonathan Pool (editor)" checksum="59a08f6c19d1d6d72cc03189983c8045" id="panlex_swadesh" license="CC0 1.0 Universal" name="PanLex Swadesh Corpora" size="2699578" subdir="corpora" unzip="0" unzipped_size="4103346" url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/panlex_swadesh.zip" webpage="http://panlex.org/" />

=>

<package author="David Kamholz" checksum="e13211688738201c0a5bd5b2f50e94ab" id="_lite" license="CC0 1.0 Universal" name="PanLex Lite Corpus" size="2202492316" subdir="corpora" unzip="1" unzipped_size="5778483185" url="http://dev.panlex.org/db/panlex_lite.zip" webpage="http://panlex.org/" />
<package author="Jonathan Pool (editor)" checksum="59a08f6c19d1d6d72cc03189983c8045" id="_swadesh" license="CC0 1.0 Universal" name="PanLex Swadesh Corpora" size="2699578" subdir="corpora" unzip="0" unzipped_size="4103346" url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/panlex_swadesh.zip" webpage="http://panlex.org/" />

@demidovakatya,
Thank you. I met the same problem.

Downloading panlex_lite should work fine now

Again not working.

I don't have bandwidth to test this. Our nltk_data page points at the April 1 version, which was not touched when the May 1 version was added recently.

@kamholz: would you mind doing the following to check if it still works please? python -m nltk.downloader panlex_lite

Sorry this keeps happening. It's hard to debug, because I often can't reproduce the reported errors. In this case, when I run python -m nltk.downloader panlex_lite, it doesn't report any error and unzips. However, the MD5 sum at https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml is incorrect. I don't know how that happened, since the file has not changed. The entry should read as follows:

    <package author="David Kamholz" checksum="3156099b9acb623725d63c727fd8591d" id="panlex_lite" license="CC0 1.0 Universal" name="PanLex Lite Corpus" size="2357864277" subdir="corpora" unzip="1" unzipped_size="5993562112" url="https://db.panlex.org/panlex_lite-20170401.zip" webpage="http://panlex.org/" />

I have also updated the URL above (but that shouldn't have made a difference for this issue, since the old one redirects), and the sizes.

Thanks for this @kamholz . I've pushed a corrected index file using these checksums.
@clockwiser would you please try again and let us know how you get on?

I tried: python -m nltk.downloader -u https://gist.githubusercontent.com/demidovakatya/61dab385d74065ae825c80496a197980/raw/c6ff7fbf44265c7f8c9e961e3e1158cd812d6af1/index.xml all and other url but all forbidden http 403 error. Any suggestions or new url that will work?

@sokhnavor this is caused by #1787

@alvations thanks! I see:
PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA
I used Window command prompt and it does not work, no wget is not recognized in the internal or external command. I'm pretty new to command line and the window flavor. Are there any workaround for this command prompt to get this to work? I would really appreciate it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DavidNemeskey picture DavidNemeskey  ·  4Comments

chaseireland picture chaseireland  ·  3Comments

alvations picture alvations  ·  4Comments

vezeli picture vezeli  ·  3Comments

BLKSerene picture BLKSerene  ·  4Comments