Nltk: how to download corpus panlex_lite package in nltk in python

Created on 17 Jan 2016  ·  30Comments  ·  Source: nltk/nltk

I am able to download all the packages except the panlex_lite how to download it?

Most helpful comment

use this url [http://dev.panlex.org/db/panlex_lite.zip] to download it manually.

All 30 comments

Try within python:

>>> import nltk
>>> nltk.download('panlex_lite')

Or on command line:

$ python -m nltk.downloader panlex_lite

Note: It might take some time to download the data.

Note that you need to install the development version of NLTK in order to do this.

use this url [http://dev.panlex.org/db/panlex_lite.zip] to download it manually.

Wait for NLTK v3.2 and please see extensive discussion on https://github.com/nltk/nltk/issues/1283

Hi once panlex_lite is downloaded manually where should I put it within nltk_data?
Thanks

corpora, my complete path is /usr/local/share/nltk_data/corpora

------------------ Original ------------------
From: "racekiller"[email protected];
Date: Sat, May 21, 2016 08:53 PM
To: "nltk/nltk"[email protected];
Cc: "肖宗阳"[email protected]; "Comment"[email protected];
Subject: Re: [nltk/nltk] how to download corpus panlex_lite package in nltk inpython (#1253)

Hi once panlex_lite is downloaded manually where should I put it within nltk_data?
Thanks


You are receiving this because you commented.
Reply to this email directly or view it on GitHub

Hi,
Does anyone have idea why its downloading so slow ? At my end its showing 20 hours. Rest of the packages have been downloaded.

@deepp I upload this zip file to baidu cloud. Following is the link and password
link: https://pan.baidu.com/s/1kVavU7d password: 7b5n

@XiaoZYang Thanks for response I downloaded file manually from your previous response link. Thanks a ton

@deepp pleasure. be glad to help u

You can download the panlex_lite.zip from https://dev.panlex.org/db/, and put it in "/nltk_data/corpora/"

While downloading panlex with nltk downloader, my whole system just froze - even the caps lock indicator light on my keyboard wasn't working anymore. I've restarted my computer, tried again and the same thing happened.
Is there a logfile anywhere to provide you with more info on this?
FYI: I'm running idle3/nltk3/python 3.5.2 on KDE Neon on an AMD64 machine.

I'll just download the zip-file manually.

what to do after downloading the zip of panlex_lite so that rest packages are downloaded when nltk.download('all') is given? so that it skips panlex_lite downloading? i unzipped the zip folder but still when i try to download rest packages it shows downloading panlex_lite... help please.

@eupherntech same issue.

I am also facing the same issue.

BTW, downloaded panlex_lite data manually.

@eupherntech @stevealbertwong You could use nltk.download('all', halt_on_error=False), so that after failing to download the package, you will be asked whether you want to retry to download it. Press n and the rest of packages should be downloaded.

Same issue here, even manually it takes up to 8 hours. Do something about it please!

Based on the file mentioned above, it looks like it's a 2.2 GB file. So you might just need to hang tight and wait!

One thing you can do in the meantime to get some more information is to look at the filesize and last modified time of the panlex_lite.zip file in nltk_data/corpora/ like so:

$ ls -lh nltk_data/corpora/ | grep panlex_lite
-rw-r--r--     1 username  1607558449   2.1G Mar  4 10:51 panlex_lite.zip

I'm having the same issue. I have panlex_lite successfully dowloaded (from http://dev.panlex.org/db/panlex_lite.zip) and located in the correct directory, but when nltk.download() is called it tries to download it again. Is there some other file that needs to be updated to show that the corpus is in place?

Please Note: I would try @cimarie 's suggestion, but the problem is that I'm trying to use tox to test a branch before submitting a pull request, and tox calls nltk.download internally, so I don't think I have the ability to include those options.

I've updated the checksums, so please try again

@stevenbird Which checksums?

Anyway, it does not appear to have worked. nltk.download('all') still tries to download panlex light, even though I have put the file attached to the above link in my ~/nltk_data/corpora folder.

Also of note, the downloader tries to download panlex_swadesh every time (although this is a much shorter download than panlex_lite). I noticed panlex_swadesh.zip is in the corpora folder, and attempting to unzip it manually gives

Arthurs-MacBook-Pro:corpora aetilley$ unzip panlex_swadesh.zip
Archive: panlex_swadesh.zip
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of panlex_swadesh.zip or
panlex_swadesh.zip.zip, and cannot find panlex_swadesh.zip.ZIP, period.

@aetilley – the checksums are published on this page – may need to "view source".

They are from this file: https://dev.panlex.org/db/panlex_lite-20170401.zip

Unfortunately I don't have the bandwidth to download it.

There's two things you might try. Maybe you already just did the first in which case the second might be worth a shot.

  1. sudo python -m nltk.downloader panlex_lite
  2. cd PATH_TO_NLTK_DATA; wget https://dev.panlex.org/db/panlex_lite-20170401.zip; unzip panlex_lite-20170401.zip

@stevenbird

I'm afraid that after running both of these (both successfully), nltk.download('all') still can't see panlex_lite.

Again, the main problem here is that it makes it difficult to use tox.

So am I the only one having this problem?

Is nltk.download('all') the main cause of these problems? If so, then I think nltk/nltk_data#69 would be something to consider.

Otherwise, the workaround is something like:

>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('all')

@alvations

More specifically, that nltk.download('all') correctly skips over all other corpora that I already have, but for some reason tries to get panlex_lite each time.

Also that tox calls nltk.download('all'), so it's difficult to test locally before making a pull request.

Hopefully, nltk/nltk_data#75 would resolve some of the issues. And after that's merged, users should be able to do nltk.download('all-nltk') instead of nltk.download('all') if they don't want to wait to download the large panlex_lite file.

@alvations

And what will tox call?

Again, I'm happy to download a large file once but the downloader doesn't seem so see that I already have it so it tries to download it every time.

And again, if I'm the only person having this problem, then maybe it's not a problem, but I'm baffled.

@aetilley: is this still happening? I think it should be fixed now that we've dropped panlex-lite from the NLTK corpus collection.

@stevenbird, @alvations

Yes, tox appears to be working for me now. Sorry, I didn't catch that you had fixed that.

Was this page helpful?
0 / 5 - 0 ratings