Nltk: Failed to download NLTK data: HTTP ERROR 405 / 403

Created on 26 Jul 2017  ·  47Comments  ·  Source: nltk/nltk

>>> nltk.download("all")
[nltk_data] Error loading all: HTTP Error 405: Not allowed.

>>> nltk.version_info
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)

Also, I tried to visit https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/cmudict.zip. Got the same HTTP 405 ERROR.

Find the same problem on stackoverflow: https://stackoverflow.com/questions/45318066/getting-405-while-trying-to-download-nltk-dta

Any comments would be appreciated.

admin bug corpus inactive

Most helpful comment

@plaihonen you should be able to use this alternative index by doing something like python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

All 47 comments

It seems like the Github is down/blocking access to the raw content on the repo.

Meanwhile the temporary solution is something like this:

PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA

Currently downloading the gh-pages.zip and replacing the nltk_data directory is the working solution for now.

Before we find on another channel to distribute nltk_data, please use the above solution.


~Strangely, it only seems to affect the nltk user account. It works fine on the fork: https://raw.githubusercontent.com/alvations/nltk_data/gh-pages/index.xml~

~Doing this would work too:~

@alvations Thank you very much!

Is there any alternative for the command line downloads such as this?
python -m nltk.downloader -d ./nltk_data punkt

@plaihonen you should be able to use this alternative index by doing something like python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

@rvause Works perfectly. Thank you!

+1. this was a several hour surprise this morning. Went with completely bypassing nltk download for now

GitHub is currently blocking access because "a user is consuming a very large amount of bandwidth requesting files". They have also suggested that we should look at a different way of distributing data packages, e.g. S3.

Even with alternate index, anyone finding that some packages still don't work?

Specifically, for me, the stopwords package gives me a 405, the others (brown, wordnet, punkt, etc) do not.

yes, Im not able to download the nltk stopwords either. I get the 405 error when I do >python -m nltk.downloader -u http://nltk.github.com/nltk_data/

Hey, I am trying to run python -m nltk.downloader stopwords, but getting 405 error. Can anyone point me in the right direction?

@dfridman1 @prakruthi-karuna read the issue above. Work around is:

python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj all

We have a few projects using this in our ci system. Rather than having to update all of them with the -u parameter is there another way we can specify that data. Maybe an environment variable or config file?

@alvations it seems your solution no longer works as the forked version is also now forbidden. Is anyone currently in contact with github support about this?

>>> import nltk
>>> dler = nltk.downloader.Downloader('https://pastebin.com/raw/D3TBY4Mj')
>>> dler.download('punkt')
[nltk_data] Downloading package punkt to /home/zeryx/nltk_data...
[nltk_data] Error downloading u'punkt' from
[nltk_data]     <https://raw.githubusercontent.com/alvations/nltk_data
[nltk_data]     /gh-pages/packages/tokenizers/punkt.zip>:   HTTP Error
[nltk_data]     403: Forbidden.
False

I've just opened a ticket with them via the contact page.

Looks like GitHub is aware and are working on the issue. Here's what they said to me:

Sorry for the trouble. We've had to block requests to raw.githubusercontent.com URLs for the nltk/nltk_data repo and its forks because excessive usage was causing issues with the GitHub service. We're working to get the issue resolved, but unfortunately we cannot allow those requests at this time.

Yeah, I've just received this too:

Hi Liling,
I work on the Support team at GitHub, and I wanted to notify you that we've had to temporarily block access to files being served from raw.githubusercontent.comURLs for the alvations/nltk_data repo. Currently, a user is consuming a very large amount of bandwidth requesting files from that repo, and our only option at the moment is to block all requests. We've actively working on ways to mitigate the problem, and we'll follow up with you when we have an update.Please let us know if you have any questions.
Cheers,Shawna

@ewan-klein @stevenbird I think we need a new way to distribute data but that'll require some rework of the nltk.downloader.py.

Some suggestions:

Seemingly, we have no choice but to change the data distribution channel:

Hi Liling,
Wanted to follow up on this with some additional information. We've been discussing the issue internally, and it's highly likely that we will not be restoring raw access to repos in the nltk/nltk_data fork network for the foreseeable future. The issue is that there are a number of machines that are calling nltk.download() at a very high frequency. We cannot restore raw access until that activity stops.Feel free to share this message with the nltk community. We're hoping that whoever is doing this will be alerted to the problem, and stop whatever process is doing this.
Cheers,Jamie

One would think they could just block those IPs specifically. But maybe there is more to it than that.

I do have a docker image that downloads nltk_data, but I wasn't rebuilding it frequently. I hope I wasn't one of those high traffic users...

Is there an installation process that does not rely on github?

Maybe someone configured their scripts on AWS wrongly. @everyone please help to check your instances while we find an alternative to data distributing

Hi Liling,
We can't share specific numbers, however the requests are coming from a large number of AWS instances. We suspect it could be a script or build process gone awry. We don't know much beyond that.
Cheers,Jamie

Well that's a relief, I don't use AWS.

:relieved:

Code-wise, maybe we've to change how frequent the same package is being updated from nltk downloader.py too. Otherwise no matter which distribution channel we migrate to, the same service disruption will happen.

Maybe something torrent-based would work?

Not sure what the license is like, but you could make them public on s3: https://aws.amazon.com/datasets/

@alvations seems only the gzip download works for now. And packages needed to be moved under /home/username/nltk_data/ folder.

export PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages $PATH_TO_NLTK_DATA
# add below code
mv $PATH_TO_NLTK_DATA/nltk_data-gh-pages/packages/* $PATH_TO_NLTK_DATA/

Do we have a temporary work-around yet?

@darshanlol @alvations mentioned a solution. If you are trying to build a docker, the following worked for me:

ENV PATH_TO_NLTK_DATA $HOME/nltk_data/
RUN apt-get -qq update
RUN apt-get -qq -y install wget
RUN wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
RUN apt-get -y install unzip
RUN unzip gh-pages.zip -d $PATH_TO_NLTK_DATA
# add below code
RUN mv $PATH_TO_NLTK_DATA/nltk_data-gh-pages/packages/* $PATH_TO_NLTK_DATA/

I try to change the default url in 'nltk.downloader.py'.But the issue still exist.

Workaround suggested is no longer working:

python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj all

Currently this is the only working solution:

PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA

As @alvations said this is the only working solution.

PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA

But even after downloading all the pages i was facing issues as my NLTK downloader was not able to detect all the packages downloaded for that you may have to manually change the download directory value through command.

This page has proper command that i used to configure NLTK data packages

Click on the above link for answer.

Here're a couple of proposals to resolve this problem after reading around and finding alternatives.

Make corpora pipable

  • First, we'll change it such that all nltk_data are pip-able. (So every new environment will require a new pip install and we no longer rely of the physical directory)
  • We will need to keep track of some sort of index too for the download to fetch and track versions.
  • Then we also need to have some sort of overhaul on the code, the downloader.py and all the related corpus reader interface

  • Possibly pip limitations (from PyPI side) can stop the rogue users/machines with high frequency requests

Hosting the data on S3 / Zenodo or some private host

This would require us to simply relink the links in the index.xml to the appropriate links. After setting up the individual files on the web host.

But if the traffic remains high due to some installation / automation script gone wrong, we end up bugging one service provider to another.


Any other suggestions?
Any brave soul who wants to take this on?

@harigovind511, yeah you have to either put the downloaded nltk_data folder in one of the standard locations where the nltk knows to look for it, or append to nltk.data.path to tell it where to look. The automatic downloader just looks for a standard location.

Rate limiting/solving for rogue machines is probably necessary to make this not rear its ugly head again. My vote would be for pip unless there's some problem (or taboo) with large packages on pip?

Using pip would also solve the manual nltk.download() and in-code package management.

Files seem back up? Seems wise though to continue to seek alternative distribution mechanisms. In my own organization though we plan to move to hosting internally and check in quarterly

I'd like to understand what $PATH_TO_NLTK_DATA does. Is it configuring an alternate local download URL for where NLTK get's it's data?

I'd like to setup a local cache of NLTK data so I was wondering if setting this tells NLTK to work offline?

Since the root of the problem is bandwidth abuse, it seems a poor idea to recommend manual fetching of the entire nltk_data tree as a work-around. How about you show us how resource ids map to URLs, @alvations, so I can wget just the punkt bundle, for example?

The long-term solution, I believe, is to make it less trivial for beginning users to fetch the entire data bundle (638MB compressed, when I checked). Instead of arranging (and paying for) more bandwidth to waste on pointless downloads, stop providing "all" as a download option; the documentation should instead show the inattentive scripter how to download the specific resource(s) they need. And in the meantime, get out of the habit of writing nltk.download("all") (or equivalent) as sample or recommended usage, on stackoverflow (I'm looking at you, @alvations) and in the downloader docstrings. (For exploring the nltk, nltk.dowload("book"), not "all", is just as useful and much smaller.)

At present it is difficult to figure out which resource needs to be downloaded; if I install the nltk and try out nltk.pos_tag(["hello", "friend"]), there's no way to map the error message to a resource ID that I can pass to nltk.download(<resource id>). Downloading everything is the obvious work-aroundin such cases. If nltk.data.load() or nltk.data.find() can be patched to look up the resource id in such cases, I think you'll see your usage on nltk_data go down significantly over the long term.

@zxiiro $PATH_TO_NLTK_DATA has no meaning to the nltk, it's just a variable in the sample script. The environment variable $NLTK_DATA does have special meaning. See http://www.nltk.org/data.html, all the options are explained.

@alexisdimi agreed on the nltk.download('all'). Sorry that was such an old answer from my early days. I should advise against it. I've changed the SO answer to nltk.download('popular') instead: https://stackoverflow.com/questions/22211525/how-do-i-download-nltk-data

One of the problem with wget directly to a package is that it's still rely on the raw content on github. During the down time, the https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip link was also leading to the 403/405 error.

Thus the work around was downloading the whole git tree. In retrospect, that might not be a good idea.

Looks like the lockout has been lifted, that's great! Now I hope there are some tickets that will work toward preventing similar problems in the future (maybe along the lines I suggested, maybe not).

(Should _this_ issue be marked "Closed", by the way, now that downloads work again?)

@alexisdimi putting warnings that suggest for users to download the appropriate models is a good idea.

For those running NLTK in a CI environment. I'd like to propose GH-1795 which allows use to specify an alternative URL for downloading. The idea here is one can setup a local copy of nltk_data on a webserver (or even python -m http.server) and then have a global variable that can override the download URL.

This is so that we can override without modifying projects local command calls to include -u from a CI system like Jenkins.

Question to Github regarding pip data distribution using releases and pip installation:

Thank you Jamie for the support!

We're looking for alternatives to host the nltk_data and one of it is to host them as repository releases as how SpaCy does it https://github.com/explosion/spacy-models/releases

Could we just check with you whether the same block will be executed if similar high frequency requests were made to the repository releases? Or is the repository releases treated differently from the raw content on Github?

Regards,
Liling

Some updates on Github side:

Hi Liling,

Using Releases just moves the requests to a different part of our infrastructure. If that volume of bandwidth were to start up again, we would still have to block those requests, even if they were to Releases.

We've tried to think of some ways that the data packages could remain on GitHub, but there's honestly not a good solution. We're just not set up to be a high volume CDN.

Cheers,
Jamie

@owaaa / @zxiiro +1 on hosting internally for CI. We're doing this now, and the advantage for EC2/S3 users is that you get to put the data (or the subset of it you need) close to where you want to build the machines. If you're across availability zones, you can just replicate buckets where you need and be more robust to what's going on outside AWS.

@alvations I quite like the _data/model as package_ idea in spaCy, but one of the consequences is that if you use virtualenv, your environment directories can balloon in size as your packages lives there. Of course, this buys you completely isolated and auditable data/model versions, which is valuable for a project with frequent model updates like spaCy, but not a free lunch 😕

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zdog234 picture zdog234  ·  3Comments

chaseireland picture chaseireland  ·  3Comments

DavidNemeskey picture DavidNemeskey  ·  4Comments

alvations picture alvations  ·  4Comments

talbaumel picture talbaumel  ·  4Comments