Nltk: Failed to download NLTK data: HTTP ERROR 405 / 403

Created on 26 Jul 2017 · 47Comments · Source: nltk/nltk

>>> nltk.download("all")
[nltk_data] Error loading all: HTTP Error 405: Not allowed.

>>> nltk.version_info
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)

Also, I tried to visit https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/cmudict.zip. Got the same HTTP 405 ERROR.

Find the same problem on stackoverflow: https://stackoverflow.com/questions/45318066/getting-405-while-trying-to-download-nltk-dta

Any comments would be appreciated.

admin bug corpus inactive

Source

matthew-z

😕12

Most helpful comment

@plaihonen you should be able to use this alternative index by doing something like python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

rvause on 26 Jul 2017

👍10

All 47 comments

It seems like the Github is down/blocking access to the raw content on the repo.

Meanwhile the temporary solution is something like this:

PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA

Currently downloading the gh-pages.zip and replacing the nltk_data directory is the working solution for now.

Before we find on another channel to distribute nltk_data, please use the above solution.

~Strangely, it only seems to affect the nltk user account. It works fine on the fork: https://raw.githubusercontent.com/alvations/nltk_data/gh-pages/index.xml~

~Doing this would work too:~

alvations on 26 Jul 2017

👍8

@alvations Thank you very much!

matthew-z on 26 Jul 2017

Is there any alternative for the command line downloads such as this?
python -m nltk.downloader -d ./nltk_data punkt

plaihonen on 26 Jul 2017

@plaihonen you should be able to use this alternative index by doing something like python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

rvause on 26 Jul 2017

👍10

@rvause Works perfectly. Thank you!

plaihonen on 26 Jul 2017

+1. this was a several hour surprise this morning. Went with completely bypassing nltk download for now

owaaa on 26 Jul 2017

GitHub is currently blocking access because "a user is consuming a very large amount of bandwidth requesting files". They have also suggested that we should look at a different way of distributing data packages, e.g. S3.

ewan-klein on 26 Jul 2017

Even with alternate index, anyone finding that some packages still don't work?

Specifically, for me, the stopwords package gives me a 405, the others (brown, wordnet, punkt, etc) do not.

kut on 26 Jul 2017

yes, Im not able to download the nltk stopwords either. I get the 405 error when I do >python -m nltk.downloader -u http://nltk.github.com/nltk_data/

prakruthi-karuna on 26 Jul 2017

Hey, I am trying to run python -m nltk.downloader stopwords, but getting 405 error. Can anyone point me in the right direction?

dfridman1 on 26 Jul 2017

@dfridman1 @prakruthi-karuna read the issue above. Work around is:

python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj all

slajax on 26 Jul 2017

👍7

We have a few projects using this in our ci system. Rather than having to update all of them with the -u parameter is there another way we can specify that data. Maybe an environment variable or config file?

zxiiro on 26 Jul 2017

@alvations it seems your solution no longer works as the forked version is also now forbidden. Is anyone currently in contact with github support about this?

>>> import nltk
>>> dler = nltk.downloader.Downloader('https://pastebin.com/raw/D3TBY4Mj')
>>> dler.download('punkt')
[nltk_data] Downloading package punkt to /home/zeryx/nltk_data...
[nltk_data] Error downloading u'punkt' from
[nltk_data]     <https://raw.githubusercontent.com/alvations/nltk_data
[nltk_data]     /gh-pages/packages/tokenizers/punkt.zip>:   HTTP Error
[nltk_data]     403: Forbidden.
False

zeryx on 26 Jul 2017

I've just opened a ticket with them via the contact page.

zxiiro on 26 Jul 2017

👍3

Looks like GitHub is aware and are working on the issue. Here's what they said to me:

Sorry for the trouble. We've had to block requests to raw.githubusercontent.com URLs for the nltk/nltk_data repo and its forks because excessive usage was causing issues with the GitHub service. We're working to get the issue resolved, but unfortunately we cannot allow those requests at this time.

zxiiro on 26 Jul 2017

Yeah, I've just received this too:

Hi Liling,
I work on the Support team at GitHub, and I wanted to notify you that we've had to temporarily block access to files being served from raw.githubusercontent.comURLs for the alvations/nltk_data repo. Currently, a user is consuming a very large amount of bandwidth requesting files from that repo, and our only option at the moment is to block all requests. We've actively working on ways to mitigate the problem, and we'll follow up with you when we have an update.Please let us know if you have any questions.
Cheers,Shawna

alvations on 27 Jul 2017

@zxiiro See https://stackoverflow.com/questions/3522372/how-to-config-nltk-data-directory-from-code

alvations on 27 Jul 2017

@ewan-klein @stevenbird I think we need a new way to distribute data but that'll require some rework of the nltk.downloader.py.

Some suggestions:

alvations on 27 Jul 2017

Seemingly, we have no choice but to change the data distribution channel:

Hi Liling,
Wanted to follow up on this with some additional information. We've been discussing the issue internally, and it's highly likely that we will not be restoring raw access to repos in the nltk/nltk_data fork network for the foreseeable future. The issue is that there are a number of machines that are calling nltk.download() at a very high frequency. We cannot restore raw access until that activity stops.Feel free to share this message with the nltk community. We're hoping that whoever is doing this will be alerted to the problem, and stop whatever process is doing this.
Cheers,Jamie

alvations on 27 Jul 2017

One would think they could just block those IPs specifically. But maybe there is more to it than that.

notnami on 27 Jul 2017

I do have a docker image that downloads nltk_data, but I wasn't rebuilding it frequently. I hope I wasn't one of those high traffic users...

jonesnc on 27 Jul 2017

Is there an installation process that does not rely on github?

jonesnc on 27 Jul 2017

Maybe someone configured their scripts on AWS wrongly. @everyone please help to check your instances while we find an alternative to data distributing

Hi Liling,
We can't share specific numbers, however the requests are coming from a large number of AWS instances. We suspect it could be a script or build process gone awry. We don't know much beyond that.
Cheers,Jamie

alvations on 27 Jul 2017

Well that's a relief, I don't use AWS.

:relieved:

jonesnc on 27 Jul 2017

Code-wise, maybe we've to change how frequent the same package is being updated from nltk downloader.py too. Otherwise no matter which distribution channel we migrate to, the same service disruption will happen.

alvations on 27 Jul 2017

Maybe something torrent-based would work?

marcolussetti on 27 Jul 2017

Not sure what the license is like, but you could make them public on s3: https://aws.amazon.com/datasets/

wejradford on 27 Jul 2017

@alvations seems only the gzip download works for now. And packages needed to be moved under /home/username/nltk_data/ folder.

export PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages $PATH_TO_NLTK_DATA
# add below code
mv $PATH_TO_NLTK_DATA/nltk_data-gh-pages/packages/* $PATH_TO_NLTK_DATA/

usakey on 27 Jul 2017

👍3

Do we have a temporary work-around yet?

darshanlol on 27 Jul 2017

👍1

@darshanlol @alvations mentioned a solution. If you are trying to build a docker, the following worked for me:

ENV PATH_TO_NLTK_DATA $HOME/nltk_data/
RUN apt-get -qq update
RUN apt-get -qq -y install wget
RUN wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
RUN apt-get -y install unzip
RUN unzip gh-pages.zip -d $PATH_TO_NLTK_DATA
# add below code
RUN mv $PATH_TO_NLTK_DATA/nltk_data-gh-pages/packages/* $PATH_TO_NLTK_DATA/

dfridman1 on 27 Jul 2017

👍3

I try to change the default url in 'nltk.downloader.py'.But the issue still exist.

XingDYe on 27 Jul 2017

Workaround suggested is no longer working:

python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj all

jamesdhope on 27 Jul 2017

Currently this is the only working solution:

PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA

alvations on 27 Jul 2017

👍4

As @alvations said this is the only working solution.

PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA

But even after downloading all the pages i was facing issues as my NLTK downloader was not able to detect all the packages downloaded for that you may have to manually change the download directory value through command.

This page has proper command that i used to configure NLTK data packages

Click on the above link for answer.

harigovind511 on 27 Jul 2017

Here're a couple of proposals to resolve this problem after reading around and finding alternatives.

Make corpora pipable

First, we'll change it such that all nltk_data are pip-able. (So every new environment will require a new pip install and we no longer rely of the physical directory)
We will need to keep track of some sort of index too for the download to fetch and track versions.
Then we also need to have some sort of overhaul on the code, the downloader.py and all the related corpus reader interface
Possibly pip limitations (from PyPI side) can stop the rogue users/machines with high frequency requests

Hosting the data on S3 / Zenodo or some private host

This would require us to simply relink the links in the index.xml to the appropriate links. After setting up the individual files on the web host.

But if the traffic remains high due to some installation / automation script gone wrong, we end up bugging one service provider to another.

Any other suggestions?
Any brave soul who wants to take this on?

alvations on 27 Jul 2017

@harigovind511, yeah you have to either put the downloaded nltk_data folder in one of the standard locations where the nltk knows to look for it, or append to nltk.data.path to tell it where to look. The automatic downloader just looks for a standard location.

alexisdimi on 27 Jul 2017

Rate limiting/solving for rogue machines is probably necessary to make this not rear its ugly head again. My vote would be for pip unless there's some problem (or taboo) with large packages on pip?

Using pip would also solve the manual nltk.download() and in-code package management.

kut on 27 Jul 2017

Files seem back up? Seems wise though to continue to seek alternative distribution mechanisms. In my own organization though we plan to move to hosting internally and check in quarterly

owaaa on 27 Jul 2017

🎉2 👍1

I'd like to understand what $PATH_TO_NLTK_DATA does. Is it configuring an alternate local download URL for where NLTK get's it's data?

I'd like to setup a local cache of NLTK data so I was wondering if setting this tells NLTK to work offline?

zxiiro on 27 Jul 2017

Since the root of the problem is bandwidth abuse, it seems a poor idea to recommend manual fetching of the entire nltk_data tree as a work-around. How about you show us how resource ids map to URLs, @alvations, so I can wget just the punkt bundle, for example?

The long-term solution, I believe, is to make it less trivial for beginning users to fetch the entire data bundle (638MB compressed, when I checked). Instead of arranging (and paying for) more bandwidth to waste on pointless downloads, stop providing "all" as a download option; the documentation should instead show the inattentive scripter how to download the specific resource(s) they need. And in the meantime, get out of the habit of writing nltk.download("all") (or equivalent) as sample or recommended usage, on stackoverflow (I'm looking at you, @alvations) and in the downloader docstrings. (For exploring the nltk, nltk.dowload("book"), not "all", is just as useful and much smaller.)

At present it is difficult to figure out which resource needs to be downloaded; if I install the nltk and try out nltk.pos_tag(["hello", "friend"]), there's no way to map the error message to a resource ID that I can pass to nltk.download(<resource id>). Downloading everything is the obvious work-aroundin such cases. If nltk.data.load() or nltk.data.find() can be patched to look up the resource id in such cases, I think you'll see your usage on nltk_data go down significantly over the long term.

alexisdimi on 27 Jul 2017

👍3

@zxiiro $PATH_TO_NLTK_DATA has no meaning to the nltk, it's just a variable in the sample script. The environment variable $NLTK_DATA does have special meaning. See http://www.nltk.org/data.html, all the options are explained.

alexisdimi on 27 Jul 2017

@alexisdimi agreed on the nltk.download('all'). Sorry that was such an old answer from my early days. I should advise against it. I've changed the SO answer to nltk.download('popular') instead: https://stackoverflow.com/questions/22211525/how-do-i-download-nltk-data

One of the problem with wget directly to a package is that it's still rely on the raw content on github. During the down time, the https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip link was also leading to the 403/405 error.

Thus the work around was downloading the whole git tree. In retrospect, that might not be a good idea.

alvations on 27 Jul 2017

Looks like the lockout has been lifted, that's great! Now I hope there are some tickets that will work toward preventing similar problems in the future (maybe along the lines I suggested, maybe not).

(Should _this_ issue be marked "Closed", by the way, now that downloads work again?)

alexisdimi on 27 Jul 2017

@alexisdimi putting warnings that suggest for users to download the appropriate models is a good idea.

alvations on 27 Jul 2017

For those running NLTK in a CI environment. I'd like to propose GH-1795 which allows use to specify an alternative URL for downloading. The idea here is one can setup a local copy of nltk_data on a webserver (or even python -m http.server) and then have a global variable that can override the download URL.

This is so that we can override without modifying projects local command calls to include -u from a CI system like Jenkins.

zxiiro on 27 Jul 2017

👍1

Question to Github regarding pip data distribution using releases and pip installation:

Thank you Jamie for the support!

We're looking for alternatives to host the nltk_data and one of it is to host them as repository releases as how SpaCy does it https://github.com/explosion/spacy-models/releases

Could we just check with you whether the same block will be executed if similar high frequency requests were made to the repository releases? Or is the repository releases treated differently from the raw content on Github?

Regards,
Liling

Some updates on Github side:

Hi Liling,

Using Releases just moves the requests to a different part of our infrastructure. If that volume of bandwidth were to start up again, we would still have to block those requests, even if they were to Releases.

We've tried to think of some ways that the data packages could remain on GitHub, but there's honestly not a good solution. We're just not set up to be a high volume CDN.

Cheers,
Jamie

alvations on 28 Jul 2017

@owaaa / @zxiiro +1 on hosting internally for CI. We're doing this now, and the advantage for EC2/S3 users is that you get to put the data (or the subset of it you need) close to where you want to build the machines. If you're across availability zones, you can just replicate buckets where you need and be more robust to what's going on outside AWS.

@alvations I quite like the _data/model as package_ idea in spaCy, but one of the consequences is that if you use virtualenv, your environment directories can balloon in size as your packages lives there. Of course, this buys you completely isolated and auditable data/model versions, which is valuable for a project with frequent model updates like spaCy, but not a free lunch 😕

wejradford on 28 Jul 2017

Was this page helpful?

0 / 5 - 0 ratings