Platform: Python 3.5 on Mac OS X 10.11.2
Steps to reproduce:
Symptoms:
# Partial console write:
[nltk_data] | Downloading package panlex_lite to
[nltk_data] | /Users/beng/nltk_data...
[nltk_data] | Unzipping corpora/panlex_lite.zip.
Traceback (most recent call last):
File "
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download
for msg in self.incr_download(info_or_id, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 543, in incr_download
for msg in self.incr_download(info.children, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 529, in incr_download
for msg in self._download_list(info_or_id, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 572, in _download_list
for msg in self.incr_download(item, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download
for msg in self._download_package(info, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package
for msg in _unzip_iter(filepath, zipdir, verbose=False):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter
outfile.write(contents)
OSError: [Errno 22] Invalid argument
@grayben – would you please install the current version of NLTK and report if you still have this issue?
@stevenbird sorry for my delay in replying - you know how uni assignments can be!
I was experiencing the issue on v3.2. I just upgraded to v3.2.1 and am having the same issue.
@grayben How did you install NLTK? Do you have an error when downloading a single corpus, e.g. nltk.download('brown')
? Do you have an error when using Python2.7?
@alvations
Additional information: a number of my classmates have reported what appears to be the same problem, though I can't comment on their configurations or exactly what they did to encounter the issue.
@grayben could you run the following lines of code and see whether you get the same [0, 448887900, 85839474]
output?
>>> import zipfile
>>> plzip = '/Users/beng/nltk_data/corpora/panlex_lite.zip'
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]
On the command line outside python, what is the output for the following?:
$ ls -lah /Users/beng//nltk_data/corpora/
Your code -> my output:
>>> import zipfile
>>> plzip = ' /Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1009, in __init__
self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: ' /Users/beng//nltk_data/corpora/panlex_lite.zip'
I then changed ' /Users/beng//nltk_data/corpora/panlex_lite.zip'
to '/Users/beng//nltk_data/corpora/panlex_lite.zip'
(no space before root):
>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1026, in __init__
self._RealGetContents()
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1093, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Bens-MacBook-Pro:10K-Extractor beng$ ls -lah /Users/beng//nltk_data/corpora/
total 966608
drwxr-xr-x 152 beng staff 5.0K 19 Apr 16:26 .
drwxr-xr-x 11 beng staff 374B 3 Mar 14:41 ..
drwxr-xr-x 5 beng staff 170B 3 Mar 14:32 abc
-rw-r--r-- 1 beng staff 1.4M 3 Mar 14:32 abc.zip
drwxr-xr-x 4 beng staff 136B 3 Mar 14:32 alpino
-rw-r--r-- 1 beng staff 2.7M 3 Mar 14:32 alpino.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:32 biocreative_ppi
-rw-r--r-- 1 beng staff 218K 3 Mar 14:32 biocreative_ppi.zip
drwxr-xr-x 505 beng staff 17K 3 Mar 14:32 brown
-rw-r--r-- 1 beng staff 3.2M 3 Mar 14:32 brown.zip
drwxr-xr-x 509 beng staff 17K 3 Mar 14:32 brown_tei
-rw-r--r-- 1 beng staff 8.3M 3 Mar 14:32 brown_tei.zip
drwxr-xr-x 1389 beng staff 46K 3 Mar 14:33 cess_cat
-rw-r--r-- 1 beng staff 5.1M 3 Mar 14:33 cess_cat.zip
drwxr-xr-x 612 beng staff 20K 3 Mar 14:33 cess_esp
-rw-r--r-- 1 beng staff 2.1M 3 Mar 14:33 cess_esp.zip
drwxr-xr-x 10 beng staff 340B 3 Mar 14:33 chat80
-rw-r--r-- 1 beng staff 19K 3 Mar 14:33 chat80.zip
drwxr-xr-x 3 beng staff 102B 3 Mar 14:33 city_database
-rw-r--r-- 1 beng staff 1.7K 3 Mar 14:33 city_database.zip
drwxr-xr-x 4 beng staff 136B 3 Mar 14:33 cmudict
-rw-r--r-- 1 beng staff 875K 3 Mar 14:33 cmudict.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:33 comparative_sentences
-rw-r--r-- 1 beng staff 273K 3 Mar 14:33 comparative_sentences.zip
-rw-r--r-- 1 beng staff 11M 3 Mar 14:33 comtrans.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:33 conll2000
-rw-r--r-- 1 beng staff 739K 3 Mar 14:33 conll2000.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:33 conll2002
-rw-r--r-- 1 beng staff 1.8M 3 Mar 14:33 conll2002.zip
-rw-r--r-- 1 beng staff 1.2M 3 Mar 14:33 conll2007.zip
drwxr-xr-x 453 beng staff 15K 3 Mar 14:33 crubadan
-rw-r--r-- 1 beng staff 5.0M 3 Mar 14:33 crubadan.zip
drwxr-xr-x 201 beng staff 6.7K 3 Mar 14:33 dependency_treebank
-rw-r--r-- 1 beng staff 447K 3 Mar 14:33 dependency_treebank.zip
drwxr-xr-x 14 beng staff 476B 3 Mar 14:33 europarl_raw
-rw-r--r-- 1 beng staff 12M 3 Mar 14:33 europarl_raw.zip
drwxr-xr-x 4 beng staff 136B 3 Mar 14:33 floresta
-rw-r--r-- 1 beng staff 1.8M 3 Mar 14:33 floresta.zip
drwxr-xr-x 16 beng staff 544B 3 Mar 14:34 framenet_v15
-rw-r--r-- 1 beng staff 66M 3 Mar 14:33 framenet_v15.zip
drwxr-xr-x 11 beng staff 374B 3 Mar 14:34 gazetteers
-rw-r--r-- 1 beng staff 8.1K 3 Mar 14:34 gazetteers.zip
drwxr-xr-x 11 beng staff 374B 3 Mar 14:34 genesis
-rw-r--r-- 1 beng staff 462K 3 Mar 14:34 genesis.zip
drwxr-xr-x 21 beng staff 714B 3 Mar 14:34 gutenberg
-rw-r--r-- 1 beng staff 4.1M 3 Mar 14:34 gutenberg.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:34 ieer
-rw-r--r-- 1 beng staff 162K 3 Mar 14:34 ieer.zip
drwxr-xr-x 59 beng staff 2.0K 3 Mar 14:34 inaugural
-rw-r--r-- 1 beng staff 314K 3 Mar 14:34 inaugural.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:34 indian
-rw-r--r-- 1 beng staff 195K 3 Mar 14:34 indian.zip
-rw-r--r-- 1 beng staff 16M 3 Mar 14:34 jeita.zip
drwxr-xr-x 22 beng staff 748B 3 Mar 14:34 kimmo
-rw-r--r-- 1 beng staff 183K 3 Mar 14:34 kimmo.zip
-rw-r--r-- 1 beng staff 8.4M 3 Mar 14:34 knbc.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:34 lin_thesaurus
-rw-r--r-- 1 beng staff 85M 3 Mar 14:34 lin_thesaurus.zip
drwxr-xr-x 112 beng staff 3.7K 3 Mar 14:34 mac_morpho
-rw-r--r-- 1 beng staff 2.9M 3 Mar 14:34 mac_morpho.zip
-rw-r--r-- 1 beng staff 5.9M 3 Mar 14:34 machado.zip
-rw-r--r-- 1 beng staff 1.5M 3 Mar 14:34 masc_tagged.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:34 movie_reviews
-rw-r--r-- 1 beng staff 3.8M 3 Mar 14:34 movie_reviews.zip
drwxr-xr-x 56 beng staff 1.9K 3 Mar 14:38 mte_teip5
-rw-r--r-- 1 beng staff 14M 3 Mar 14:38 mte_teip5.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:34 names
-rw-r--r-- 1 beng staff 21K 3 Mar 14:34 names.zip
-rw-r--r-- 1 beng staff 6.4M 3 Mar 14:35 nombank.1.0.zip
drwxr-xr-x 19 beng staff 646B 3 Mar 14:35 nps_chat
-rw-r--r-- 1 beng staff 294K 3 Mar 14:35 nps_chat.zip
drwxr-xr-x 32 beng staff 1.1K 3 Mar 14:35 omw
-rw-r--r-- 1 beng staff 11M 3 Mar 14:35 omw.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 opinion_lexicon
-rw-r--r-- 1 beng staff 24K 3 Mar 14:35 opinion_lexicon.zip
drwxr-xr-x 4 beng staff 136B 21 Mar 17:54 panlex_lite
-rw-r--r-- 1 beng staff 58M 19 Apr 16:28 panlex_lite.zip
-rw-r--r-- 1 beng staff 2.6M 3 Mar 14:37 panlex_swadesh.zip
drwxr-xr-x 21 beng staff 714B 3 Mar 14:35 paradigms
-rw-r--r-- 1 beng staff 24K 3 Mar 14:35 paradigms.zip
drwxr-xr-x 475 beng staff 16K 3 Mar 14:35 pil
-rw-r--r-- 1 beng staff 1.4M 3 Mar 14:35 pil.zip
drwxr-xr-x 16 beng staff 544B 3 Mar 14:35 pl196x
-rw-r--r-- 1 beng staff 6.7M 3 Mar 14:35 pl196x.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:35 ppattach
-rw-r--r-- 1 beng staff 763K 3 Mar 14:35 ppattach.zip
drwxr-xr-x 8 beng staff 272B 3 Mar 14:35 problem_reports
-rw-r--r-- 1 beng staff 1.0M 3 Mar 14:35 problem_reports.zip
drwxr-xr-x 8 beng staff 272B 3 Mar 14:35 product_reviews_1
-rw-r--r-- 1 beng staff 138K 3 Mar 14:35 product_reviews_1.zip
drwxr-xr-x 12 beng staff 408B 3 Mar 14:35 product_reviews_2
-rw-r--r-- 1 beng staff 167K 3 Mar 14:35 product_reviews_2.zip
-rw-r--r-- 1 beng staff 5.1M 3 Mar 14:35 propbank.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 pros_cons
-rw-r--r-- 1 beng staff 729K 3 Mar 14:35 pros_cons.zip
drwxr-xr-x 3 beng staff 102B 3 Mar 14:35 ptb
-rw-r--r-- 1 beng staff 6.1K 3 Mar 14:35 ptb.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 qc
-rw-r--r-- 1 beng staff 123K 3 Mar 14:35 qc.zip
-rw-r--r-- 1 beng staff 6.1M 3 Mar 14:35 reuters.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:35 rte
-rw-r--r-- 1 beng staff 377K 3 Mar 14:35 rte.zip
-rw-r--r-- 1 beng staff 4.2M 3 Mar 14:35 semcor.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:35 senseval
-rw-r--r-- 1 beng staff 2.1M 3 Mar 14:35 senseval.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 sentence_polarity
-rw-r--r-- 1 beng staff 479K 3 Mar 14:35 sentence_polarity.zip
drwxr-xr-x 4 beng staff 136B 3 Mar 14:35 sentiwordnet
-rw-r--r-- 1 beng staff 4.5M 3 Mar 14:35 sentiwordnet.zip
drwxr-xr-x 13 beng staff 442B 3 Mar 14:35 shakespeare
-rw-r--r-- 1 beng staff 464K 3 Mar 14:35 shakespeare.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 sinica_treebank
-rw-r--r-- 1 beng staff 878K 3 Mar 14:35 sinica_treebank.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:35 smultron
-rw-r--r-- 1 beng staff 162K 3 Mar 14:35 smultron.zip
drwxr-xr-x 68 beng staff 2.3K 3 Mar 14:35 state_union
-rw-r--r-- 1 beng staff 790K 3 Mar 14:35 state_union.zip
drwxr-xr-x 17 beng staff 578B 3 Mar 14:35 stopwords
-rw-r--r-- 1 beng staff 8.9K 3 Mar 14:35 stopwords.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 subjectivity
-rw-r--r-- 1 beng staff 509K 3 Mar 14:35 subjectivity.zip
drwxr-xr-x 27 beng staff 918B 3 Mar 14:35 swadesh
-rw-r--r-- 1 beng staff 22K 3 Mar 14:35 swadesh.zip
drwxr-xr-x 8 beng staff 272B 3 Mar 14:35 switchboard
-rw-r--r-- 1 beng staff 773K 3 Mar 14:35 switchboard.zip
drwxr-xr-x 39 beng staff 1.3K 3 Mar 14:35 timit
-rw-r--r-- 1 beng staff 21M 3 Mar 14:35 timit.zip
drwxr-xr-x 8 beng staff 272B 3 Mar 14:35 toolbox
-rw-r--r-- 1 beng staff 245K 3 Mar 14:35 toolbox.zip
drwxr-xr-x 12 beng staff 408B 3 Mar 14:36 treebank
-rw-r--r-- 1 beng staff 1.6M 3 Mar 14:36 treebank.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:36 twitter_samples
-rw-r--r-- 1 beng staff 15M 3 Mar 14:36 twitter_samples.zip
drwxr-xr-x 337 beng staff 11K 3 Mar 14:36 udhr
-rw-r--r-- 1 beng staff 1.1M 3 Mar 14:36 udhr.zip
drwxr-xr-x 390 beng staff 13K 3 Mar 14:36 udhr2
-rw-r--r-- 1 beng staff 1.6M 3 Mar 14:36 udhr2.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:36 unicode_samples
-rw-r--r-- 1 beng staff 1.2K 3 Mar 14:36 unicode_samples.zip
-rw-r--r-- 1 beng staff 25M 3 Mar 14:36 universal_treebanks_v20.zip
drwxr-xr-x 242 beng staff 8.0K 3 Mar 14:36 verbnet
-rw-r--r-- 1 beng staff 316K 3 Mar 14:36 verbnet.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:36 webtext
-rw-r--r-- 1 beng staff 631K 3 Mar 14:36 webtext.zip
drwxr-xr-x 20 beng staff 680B 3 Mar 14:36 wordnet
-rw-r--r-- 1 beng staff 10M 3 Mar 14:36 wordnet.zip
drwxr-xr-x 30 beng staff 1.0K 3 Mar 14:36 wordnet_ic
-rw-r--r-- 1 beng staff 11M 3 Mar 14:36 wordnet_ic.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:36 words
-rw-r--r-- 1 beng staff 740K 3 Mar 14:36 words.zip
drwxr-xr-x 3 beng staff 102B 3 Mar 14:36 ycoe
-rw-r--r-- 1 beng staff 477B 3 Mar 14:36 ycoe.zip
This suggests that when downloading, the file gets corrupted (possibly due to broken internet connection):
>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1026, in __init__
self._RealGetContents()
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/zipfile.py", line 1093, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Go to '/Users/beng//nltk_data/corpora/
, delete the panlex_lite.zip
file and then re-download it again. Note it might take up to 2+ hours or more to download that zipfile when the server is overloaded or your internet connection is slow.
I did the following (three times):
rm /Users/beng//nltk_data/corpora/panlex_lite.zip
python3
>>> import nltk
>>> nltk.download('panlex_lite')
[nltk_data] Downloading package panlex_lite to
[nltk_data] /Users/beng/nltk_data...
[nltk_data] Unzipping corpora/panlex_lite.zip.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download
for msg in self.incr_download(info_or_id, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download
for msg in self._download_package(info, download_dir, force):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package
for msg in _unzip_iter(filepath, zipdir, verbose=False):
File "/usr/local/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter
outfile.write(contents)
OSError: [Errno 22] Invalid argument
>>>
However, please also note the following command input/output:
>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> import zipfile
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]
Can you also do rm -rf /Users/beng//nltk_data/corpora/panlex_lite
before running the python3
?
i.e.:
$ rm /Users/beng//nltk_data/corpora/panlex_lite.zip
$ rm -rf /Users/beng//nltk_data/corpora/panlex_lite
$ python -m nltk.downloader panlex_lite
$ python3
>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> import zipfile
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]
I couldn't reproduce your OSError
on Ubuntu 14.04 Python 3.5.1:
alvas@ubi:~/nltk_data/corpora$ ls panlex_
panlex_lite.zip panlex_swadesh.zip
alvas@ubi:~/nltk_data/corpora$ cd
alvas@ubi:~$ python
Python 2.7.11 (default, Dec 15 2015, 16:46:19)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> nltk.download('panlex_lite')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'nltk' is not defined
>>> import nltk
>>> nltk.download('panlex_lite')
[nltk_data] Downloading package panlex_lite to
[nltk_data] /home/alvas/nltk_data...
[nltk_data] Package panlex_lite is already up-to-date!
True
>>> exit()
alvas@ubi:~$ python3
Python 3.5.1 (default, Dec 18 2015, 00:00:00)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('panlex_lite')
[nltk_data] Downloading package panlex_lite to
[nltk_data] /home/alvas/nltk_data...
[nltk_data] Package panlex_lite is already up-to-date!
True
BTW, if you're not going to use panlex
, the rest of NLTK
will work just fine without it =)
Bens-MacBook-Pro:work beng$ rm -rf /Users/beng//nltk_data/corpora/panlex_lite
Bens-MacBook-Pro:work beng$ ls -lah /Users/beng//nltk_data/corpora
total 4361152
drwxr-xr-x 151 beng staff 5.0K 20 Apr 13:12 .
drwxr-xr-x 11 beng staff 374B 3 Mar 14:41 ..
drwxr-xr-x 5 beng staff 170B 3 Mar 14:32 abc
-rw-r--r-- 1 beng staff 1.4M 3 Mar 14:32 abc.zip
drwxr-xr-x 4 beng staff 136B 3 Mar 14:32 alpino
-rw-r--r-- 1 beng staff 2.7M 3 Mar 14:32 alpino.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:32 biocreative_ppi
-rw-r--r-- 1 beng staff 218K 3 Mar 14:32 biocreative_ppi.zip
drwxr-xr-x 505 beng staff 17K 3 Mar 14:32 brown
-rw-r--r-- 1 beng staff 3.2M 3 Mar 14:32 brown.zip
drwxr-xr-x 509 beng staff 17K 3 Mar 14:32 brown_tei
-rw-r--r-- 1 beng staff 8.3M 3 Mar 14:32 brown_tei.zip
drwxr-xr-x 1389 beng staff 46K 3 Mar 14:33 cess_cat
-rw-r--r-- 1 beng staff 5.1M 3 Mar 14:33 cess_cat.zip
drwxr-xr-x 612 beng staff 20K 3 Mar 14:33 cess_esp
-rw-r--r-- 1 beng staff 2.1M 3 Mar 14:33 cess_esp.zip
drwxr-xr-x 10 beng staff 340B 3 Mar 14:33 chat80
-rw-r--r-- 1 beng staff 19K 3 Mar 14:33 chat80.zip
drwxr-xr-x 3 beng staff 102B 3 Mar 14:33 city_database
-rw-r--r-- 1 beng staff 1.7K 3 Mar 14:33 city_database.zip
drwxr-xr-x 4 beng staff 136B 3 Mar 14:33 cmudict
-rw-r--r-- 1 beng staff 875K 3 Mar 14:33 cmudict.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:33 comparative_sentences
-rw-r--r-- 1 beng staff 273K 3 Mar 14:33 comparative_sentences.zip
-rw-r--r-- 1 beng staff 11M 3 Mar 14:33 comtrans.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:33 conll2000
-rw-r--r-- 1 beng staff 739K 3 Mar 14:33 conll2000.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:33 conll2002
-rw-r--r-- 1 beng staff 1.8M 3 Mar 14:33 conll2002.zip
-rw-r--r-- 1 beng staff 1.2M 3 Mar 14:33 conll2007.zip
drwxr-xr-x 453 beng staff 15K 3 Mar 14:33 crubadan
-rw-r--r-- 1 beng staff 5.0M 3 Mar 14:33 crubadan.zip
drwxr-xr-x 201 beng staff 6.7K 3 Mar 14:33 dependency_treebank
-rw-r--r-- 1 beng staff 447K 3 Mar 14:33 dependency_treebank.zip
drwxr-xr-x 14 beng staff 476B 3 Mar 14:33 europarl_raw
-rw-r--r-- 1 beng staff 12M 3 Mar 14:33 europarl_raw.zip
drwxr-xr-x 4 beng staff 136B 3 Mar 14:33 floresta
-rw-r--r-- 1 beng staff 1.8M 3 Mar 14:33 floresta.zip
drwxr-xr-x 16 beng staff 544B 3 Mar 14:34 framenet_v15
-rw-r--r-- 1 beng staff 66M 3 Mar 14:33 framenet_v15.zip
drwxr-xr-x 11 beng staff 374B 3 Mar 14:34 gazetteers
-rw-r--r-- 1 beng staff 8.1K 3 Mar 14:34 gazetteers.zip
drwxr-xr-x 11 beng staff 374B 3 Mar 14:34 genesis
-rw-r--r-- 1 beng staff 462K 3 Mar 14:34 genesis.zip
drwxr-xr-x 21 beng staff 714B 3 Mar 14:34 gutenberg
-rw-r--r-- 1 beng staff 4.1M 3 Mar 14:34 gutenberg.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:34 ieer
-rw-r--r-- 1 beng staff 162K 3 Mar 14:34 ieer.zip
drwxr-xr-x 59 beng staff 2.0K 3 Mar 14:34 inaugural
-rw-r--r-- 1 beng staff 314K 3 Mar 14:34 inaugural.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:34 indian
-rw-r--r-- 1 beng staff 195K 3 Mar 14:34 indian.zip
-rw-r--r-- 1 beng staff 16M 3 Mar 14:34 jeita.zip
drwxr-xr-x 22 beng staff 748B 3 Mar 14:34 kimmo
-rw-r--r-- 1 beng staff 183K 3 Mar 14:34 kimmo.zip
-rw-r--r-- 1 beng staff 8.4M 3 Mar 14:34 knbc.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:34 lin_thesaurus
-rw-r--r-- 1 beng staff 85M 3 Mar 14:34 lin_thesaurus.zip
drwxr-xr-x 112 beng staff 3.7K 3 Mar 14:34 mac_morpho
-rw-r--r-- 1 beng staff 2.9M 3 Mar 14:34 mac_morpho.zip
-rw-r--r-- 1 beng staff 5.9M 3 Mar 14:34 machado.zip
-rw-r--r-- 1 beng staff 1.5M 3 Mar 14:34 masc_tagged.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:34 movie_reviews
-rw-r--r-- 1 beng staff 3.8M 3 Mar 14:34 movie_reviews.zip
drwxr-xr-x 56 beng staff 1.9K 3 Mar 14:38 mte_teip5
-rw-r--r-- 1 beng staff 14M 3 Mar 14:38 mte_teip5.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:34 names
-rw-r--r-- 1 beng staff 21K 3 Mar 14:34 names.zip
-rw-r--r-- 1 beng staff 6.4M 3 Mar 14:35 nombank.1.0.zip
drwxr-xr-x 19 beng staff 646B 3 Mar 14:35 nps_chat
-rw-r--r-- 1 beng staff 294K 3 Mar 14:35 nps_chat.zip
drwxr-xr-x 32 beng staff 1.1K 3 Mar 14:35 omw
-rw-r--r-- 1 beng staff 11M 3 Mar 14:35 omw.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 opinion_lexicon
-rw-r--r-- 1 beng staff 24K 3 Mar 14:35 opinion_lexicon.zip
-rw-r--r-- 1 beng staff 1.7G 20 Apr 12:46 panlex_lite.zip
-rw-r--r-- 1 beng staff 2.6M 3 Mar 14:37 panlex_swadesh.zip
drwxr-xr-x 21 beng staff 714B 3 Mar 14:35 paradigms
-rw-r--r-- 1 beng staff 24K 3 Mar 14:35 paradigms.zip
drwxr-xr-x 475 beng staff 16K 3 Mar 14:35 pil
-rw-r--r-- 1 beng staff 1.4M 3 Mar 14:35 pil.zip
drwxr-xr-x 16 beng staff 544B 3 Mar 14:35 pl196x
-rw-r--r-- 1 beng staff 6.7M 3 Mar 14:35 pl196x.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:35 ppattach
-rw-r--r-- 1 beng staff 763K 3 Mar 14:35 ppattach.zip
drwxr-xr-x 8 beng staff 272B 3 Mar 14:35 problem_reports
-rw-r--r-- 1 beng staff 1.0M 3 Mar 14:35 problem_reports.zip
drwxr-xr-x 8 beng staff 272B 3 Mar 14:35 product_reviews_1
-rw-r--r-- 1 beng staff 138K 3 Mar 14:35 product_reviews_1.zip
drwxr-xr-x 12 beng staff 408B 3 Mar 14:35 product_reviews_2
-rw-r--r-- 1 beng staff 167K 3 Mar 14:35 product_reviews_2.zip
-rw-r--r-- 1 beng staff 5.1M 3 Mar 14:35 propbank.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 pros_cons
-rw-r--r-- 1 beng staff 729K 3 Mar 14:35 pros_cons.zip
drwxr-xr-x 3 beng staff 102B 3 Mar 14:35 ptb
-rw-r--r-- 1 beng staff 6.1K 3 Mar 14:35 ptb.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 qc
-rw-r--r-- 1 beng staff 123K 3 Mar 14:35 qc.zip
-rw-r--r-- 1 beng staff 6.1M 3 Mar 14:35 reuters.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:35 rte
-rw-r--r-- 1 beng staff 377K 3 Mar 14:35 rte.zip
-rw-r--r-- 1 beng staff 4.2M 3 Mar 14:35 semcor.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:35 senseval
-rw-r--r-- 1 beng staff 2.1M 3 Mar 14:35 senseval.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 sentence_polarity
-rw-r--r-- 1 beng staff 479K 3 Mar 14:35 sentence_polarity.zip
drwxr-xr-x 4 beng staff 136B 3 Mar 14:35 sentiwordnet
-rw-r--r-- 1 beng staff 4.5M 3 Mar 14:35 sentiwordnet.zip
drwxr-xr-x 13 beng staff 442B 3 Mar 14:35 shakespeare
-rw-r--r-- 1 beng staff 464K 3 Mar 14:35 shakespeare.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 sinica_treebank
-rw-r--r-- 1 beng staff 878K 3 Mar 14:35 sinica_treebank.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:35 smultron
-rw-r--r-- 1 beng staff 162K 3 Mar 14:35 smultron.zip
drwxr-xr-x 68 beng staff 2.3K 3 Mar 14:35 state_union
-rw-r--r-- 1 beng staff 790K 3 Mar 14:35 state_union.zip
drwxr-xr-x 17 beng staff 578B 3 Mar 14:35 stopwords
-rw-r--r-- 1 beng staff 8.9K 3 Mar 14:35 stopwords.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:35 subjectivity
-rw-r--r-- 1 beng staff 509K 3 Mar 14:35 subjectivity.zip
drwxr-xr-x 27 beng staff 918B 3 Mar 14:35 swadesh
-rw-r--r-- 1 beng staff 22K 3 Mar 14:35 swadesh.zip
drwxr-xr-x 8 beng staff 272B 3 Mar 14:35 switchboard
-rw-r--r-- 1 beng staff 773K 3 Mar 14:35 switchboard.zip
drwxr-xr-x 39 beng staff 1.3K 3 Mar 14:35 timit
-rw-r--r-- 1 beng staff 21M 3 Mar 14:35 timit.zip
drwxr-xr-x 8 beng staff 272B 3 Mar 14:35 toolbox
-rw-r--r-- 1 beng staff 245K 3 Mar 14:35 toolbox.zip
drwxr-xr-x 12 beng staff 408B 3 Mar 14:36 treebank
-rw-r--r-- 1 beng staff 1.6M 3 Mar 14:36 treebank.zip
drwxr-xr-x 7 beng staff 238B 3 Mar 14:36 twitter_samples
-rw-r--r-- 1 beng staff 15M 3 Mar 14:36 twitter_samples.zip
drwxr-xr-x 337 beng staff 11K 3 Mar 14:36 udhr
-rw-r--r-- 1 beng staff 1.1M 3 Mar 14:36 udhr.zip
drwxr-xr-x 390 beng staff 13K 3 Mar 14:36 udhr2
-rw-r--r-- 1 beng staff 1.6M 3 Mar 14:36 udhr2.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:36 unicode_samples
-rw-r--r-- 1 beng staff 1.2K 3 Mar 14:36 unicode_samples.zip
-rw-r--r-- 1 beng staff 25M 3 Mar 14:36 universal_treebanks_v20.zip
drwxr-xr-x 242 beng staff 8.0K 3 Mar 14:36 verbnet
-rw-r--r-- 1 beng staff 316K 3 Mar 14:36 verbnet.zip
drwxr-xr-x 9 beng staff 306B 3 Mar 14:36 webtext
-rw-r--r-- 1 beng staff 631K 3 Mar 14:36 webtext.zip
drwxr-xr-x 20 beng staff 680B 3 Mar 14:36 wordnet
-rw-r--r-- 1 beng staff 10M 3 Mar 14:36 wordnet.zip
drwxr-xr-x 30 beng staff 1.0K 3 Mar 14:36 wordnet_ic
-rw-r--r-- 1 beng staff 11M 3 Mar 14:36 wordnet_ic.zip
drwxr-xr-x 5 beng staff 170B 3 Mar 14:36 words
-rw-r--r-- 1 beng staff 740K 3 Mar 14:36 words.zip
drwxr-xr-x 3 beng staff 102B 3 Mar 14:36 ycoe
-rw-r--r-- 1 beng staff 477B 3 Mar 14:36 ycoe.zip
Bens-MacBook-Pro:work beng$ python3
Python 3.5.1 (default, Mar 3 2016, 14:25:53)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('panlex_lite')
[nltk_data] Downloading package panlex_lite to
[nltk_data] /Users/beng/nltk_data...
[nltk_data] Package panlex_lite is already up-to-date!
True
Furthermore, through the downloader GUI, downloading "all" finally succeeds, with all fields marked "installed".
Great! So there's no OSError
now? It's the broken panlex_lite
directory (from previous downloads) lingering around that caused the OSError
. Once the infolist
of the zipfile is right, there shouldn't be a problem.
Enjoy playing around NLTK! Tell your friends/classmates to do the same too:
$ rm /Users/beng//nltk_data/corpora/panlex_lite.zip
$ rm -rf /Users/beng//nltk_data/corpora/panlex_lite
$ python -m nltk.downloader panlex_lite
$ python3
>>> plzip = '/Users/beng//nltk_data/corpora/panlex_lite.zip'
>>> import zipfile
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]
Thanks!
I get the exact same problem with latest NLTK 3.2.1 both on Ubuntu 16.04 (which crashes my whole OS) and on OSX I get the same errors as OP. I'm surprised that this case has been closed as if there was nothing wrong with it.
When trying the workaround it fails after this step, as it tried to extract it automatically right after downloading it: python -m nltk.downloader panlex_lite
[nltk_data] Downloading package panlex_lite to
[nltk_data] /Users/houmie/nltk_data...
[nltk_data] Unzipping corpora/panlex_lite.zip.
Traceback (most recent call last):
File "/Users/houmie/.pyenv/versions/3.5.1/lib/python3.5/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/Users/houmie/.pyenv/versions/3.5.1/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 2268, in <module>
halt_on_error=options.halt_on_error)
File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 664, in download
for msg in self.incr_download(info_or_id, download_dir, force):
File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 549, in incr_download
for msg in self._download_package(info, download_dir, force):
File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 638, in _download_package
for msg in _unzip_iter(filepath, zipdir, verbose=False):
File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 2039, in _unzip_iter
outfile.write(contents)
OSError: [Errno 22] Invalid argument
Thanks
@houmie what is your output for:
$ rm /Users/houmie//nltk_data/corpora/panlex_lite.zip
$ rm -rf /Users/houmie//nltk_data/corpora/panlex_lite
$ python -m nltk.downloader panlex_lite
$ python3
>>> plzip = '/Users/houmie//nltk_data/corpora/panlex_lite.zip'
>>> import zipfile
>>> [zifo.CRC for zifo in zipfile.ZipFile(plzip).infolist()]
[0, 448887900, 85839474]
This is not fixed - it's happening for python 2.7, 3.4.3, and 3.5.1. The panlex_lite download hangs for quite a while, and then unzipping freezes the GUI and/or causes the OSError.
I hit the same issue on my Macbook Pro with (OS X EI Capitain, Anaconda 1.4.0+python 3.5.2) and I tried NLTK version on both "conda install nltk" with 3.2.1 and "sudo python3 setup.py install" with github master branch. The interesting part is that I never got the CRC [0, 448887900, 85839474] but [0, 448887900, 84607019] always after I tried to download panlex_lite.zip more than 5 times. Any hint or clue?
Unfortunately they refuse the problem would even exist. I reported this in May 2016 and still no acknowledgement of the problem.
I just tried it again via the GUI download and still get this error message shown in the console:
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 655, in download
self._interactive_download()
File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 974, in _interactive_download
DownloaderGUI(self).mainloop()
File "/Users/houmie/.pyenv/versions/venv35/lib/python3.5/site-packages/nltk/downloader.py", line 1709, in mainloop
self.top.mainloop(*args, **kwargs)
File "/Users/houmie/.pyenv/versions/3.5.1/lib/python3.5/tkinter/__init__.py", line 1131, in mainloop
self.tk.mainloop(n)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
>>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 0: invalid continuation byte
This is a massive pain to me, as I need to go through the code and delete all the references to Pantex in order to get the packages working.
Hi, same here, hopefully if enough people report it then it's going to get fixed at some point ...
okay there, here's what I've done
d = nltk.downloader.Downloader()
d._packages.pop('panlex_lite')
d.download()
# error message
d._packages.pop('panlex_lite')
/usr/local/lib/python3.5/site-packages/nltk/downloader.py in info(self, id)
876 if id in self._packages: return self._packages[id]
877 if id in self._collections: return self._collections[id]
--> 878 raise ValueError('Package %r not found in index' % id)
879
880 def xmlinfo(self, id):
I guess, we could add something like if id != 'panlex_lite'
to the code...
But, as for me, the easiest way looks like this:
panlex
from itpython -m nltk.downloader -d /usr/local/share/nltk_data -u https://gist.githubusercontent.com/demidovakatya/61dab385d74065ae825c80496a197980/raw/c6ff7fbf44265c7f8c9e961e3e1158cd812d6af1/index.xml all
Aaaaaaand.... Done downloading collection all
! 🎉🎉🎉🎉
@demidovakatya
I'd like to understand that you mention that
that means
<package author="David Kamholz" checksum="e13211688738201c0a5bd5b2f50e94ab" id="panlex_lite" license="CC0 1.0 Universal" name="PanLex Lite Corpus" size="2202492316" subdir="corpora" unzip="1" unzipped_size="5778483185" url="http://dev.panlex.org/db/panlex_lite.zip" webpage="http://panlex.org/" />
<package author="Jonathan Pool (editor)" checksum="59a08f6c19d1d6d72cc03189983c8045" id="panlex_swadesh" license="CC0 1.0 Universal" name="PanLex Swadesh Corpora" size="2699578" subdir="corpora" unzip="0" unzipped_size="4103346" url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/panlex_swadesh.zip" webpage="http://panlex.org/" />
=>
<package author="David Kamholz" checksum="e13211688738201c0a5bd5b2f50e94ab" id="_lite" license="CC0 1.0 Universal" name="PanLex Lite Corpus" size="2202492316" subdir="corpora" unzip="1" unzipped_size="5778483185" url="http://dev.panlex.org/db/panlex_lite.zip" webpage="http://panlex.org/" />
<package author="Jonathan Pool (editor)" checksum="59a08f6c19d1d6d72cc03189983c8045" id="_swadesh" license="CC0 1.0 Universal" name="PanLex Swadesh Corpora" size="2699578" subdir="corpora" unzip="0" unzipped_size="4103346" url="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/panlex_swadesh.zip" webpage="http://panlex.org/" />
@demidovakatya,
Thank you. I met the same problem.
Downloading panlex_lite should work fine now
Again not working.
I don't have bandwidth to test this. Our nltk_data page points at the April 1 version, which was not touched when the May 1 version was added recently.
@kamholz: would you mind doing the following to check if it still works please? python -m nltk.downloader panlex_lite
Sorry this keeps happening. It's hard to debug, because I often can't reproduce the reported errors. In this case, when I run python -m nltk.downloader panlex_lite
, it doesn't report any error and unzips. However, the MD5 sum at https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
is incorrect. I don't know how that happened, since the file has not changed. The entry should read as follows:
<package author="David Kamholz" checksum="3156099b9acb623725d63c727fd8591d" id="panlex_lite" license="CC0 1.0 Universal" name="PanLex Lite Corpus" size="2357864277" subdir="corpora" unzip="1" unzipped_size="5993562112" url="https://db.panlex.org/panlex_lite-20170401.zip" webpage="http://panlex.org/" />
I have also updated the URL above (but that shouldn't have made a difference for this issue, since the old one redirects), and the sizes.
Thanks for this @kamholz . I've pushed a corrected index file using these checksums.
@clockwiser would you please try again and let us know how you get on?
I tried: python -m nltk.downloader -u https://gist.githubusercontent.com/demidovakatya/61dab385d74065ae825c80496a197980/raw/c6ff7fbf44265c7f8c9e961e3e1158cd812d6af1/index.xml all and other url but all forbidden http 403 error. Any suggestions or new url that will work?
@sokhnavor this is caused by #1787
@alvations thanks! I see:
PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA
I used Window command prompt and it does not work, no wget is not recognized in the internal or external command. I'm pretty new to command line and the window flavor. Are there any workaround for this command prompt to get this to work? I would really appreciate it.
Most helpful comment
Hi, same here, hopefully if enough people report it then it's going to get fixed at some point ...