Pygithub: Downloading large files

Created on 20 Nov 2017  ·  5Comments  ·  Source: PyGithub/PyGithub

Using the .get_contents() method to try to download a large file raises the error:

{'errors': [{'code': 'too_large', 'field': 'data',
     'resource': 'Blob'}],
     'message': 'This API returns blobs up to 1 MB in size. The requested blob is too large to fetch via the API, but you can use the Git Data API to request blobs up to 100 MB in size.',
     'documentation_url': 'https://developer.github.com/v3/repos/contents/#get-contents'}

Is there a way of detecting this and passing over to another handler that can download the file?

For example, if something like this fails:

contents = repository.get_dir_contents(urllib.parse.quote(server_path), ref=sha)

for content in contents:
   if content.type != 'dir':
     file_content = repository.get_contents(urllib.parse.quote(content.path), ref=sha)

optionally revert to:

file_content = repository.get_git_blob(content.sha)
question

Most helpful comment

I have the same problem and end up doing something along the line of.

  1. if we dump all files from a directory and some are larger than 1M,
 file_contents = repo.get_contents(dir_name, ref=branch)

then sha exists for each file_content, and the following could be used to grab the blob of each file

for file_content in file_contents:
    try:
        if file_content.encoding != 'base64':
            # some error ...
        # ok... 
    except GithubException:
        # if file_content DOES NOT HAVE encoding, it is a large file 
        blob = repo.get_git_blob(file_content.sha)
        # do something with blob

If path_name refers to a single file that is larger than 1M, it has to be some try/exception block like follows:

        try:
            res = repo.get_contents(path_name, ref=branch)
            # ok, we have the content
        except GithubException:
           return get_blob_content(repo, branch, path_name)

where get_blob_content is something like

def get_blob_content(repo, branch, path_name):
    # first get the branch reference
    ref = repo.get_git_ref(f'heads/{branch}')
    # then get the tree
    tree = repo.get_git_tree(ref.object.sha, recursive='/' in path_name).tree
    # look for path in tree
    sha = [x.sha for x in tree if x.path == path_name]
    if not sha:
        # well, not found..
        return None
    # we have sha
    return repo.get_git_blob(sha[0])

Real code with error-checking is longer, but the idea is here.

All 5 comments

I've run into this problem before too. In my case, since I always had the SHA of the blob, I just used git_git_blob instead.

However, get_git_blob doesn't work for any object type besides blob (hence the name). You need to know the type of the object before attempting to call it.

To do the fallback, you need to know two pieces of information:

  1. The type of the object.
  2. The SHA of the object.

If get_contents fails, it doesn't tell you either of these things. There isn't really any good way of doing the fallback as far as I can tell.

Closed as wontfix. If anyone has a good idea on how to solve this, I'm happy to reopen. As far as I can tell, it doesn't look like it's possible to do in a clean way.

I have the same problem and end up doing something along the line of.

  1. if we dump all files from a directory and some are larger than 1M,
 file_contents = repo.get_contents(dir_name, ref=branch)

then sha exists for each file_content, and the following could be used to grab the blob of each file

for file_content in file_contents:
    try:
        if file_content.encoding != 'base64':
            # some error ...
        # ok... 
    except GithubException:
        # if file_content DOES NOT HAVE encoding, it is a large file 
        blob = repo.get_git_blob(file_content.sha)
        # do something with blob

If path_name refers to a single file that is larger than 1M, it has to be some try/exception block like follows:

        try:
            res = repo.get_contents(path_name, ref=branch)
            # ok, we have the content
        except GithubException:
           return get_blob_content(repo, branch, path_name)

where get_blob_content is something like

def get_blob_content(repo, branch, path_name):
    # first get the branch reference
    ref = repo.get_git_ref(f'heads/{branch}')
    # then get the tree
    tree = repo.get_git_tree(ref.object.sha, recursive='/' in path_name).tree
    # look for path in tree
    sha = [x.sha for x in tree if x.path == path_name]
    if not sha:
        # well, not found..
        return None
    # we have sha
    return repo.get_git_blob(sha[0])

Real code with error-checking is longer, but the idea is here.

When get the blob, following code will be useful.

    blob = repo.get_git_blob(sha[0])
    b64 = base64.b64decode(blob.content)
    return b64.decode("utf8")

Also, update file will also encounter with this problem.

raise self.__createException(status, responseHeaders, output)

github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/reference/repos#get-repository-content"} getting this error when trying to download a got repository files for master branch

Was this page helpful?
0 / 5 - 0 ratings