Pygithub: Descarga de archivos grandes

Creado en 20 nov. 2017 · 5Comentarios · Fuente: PyGithub/PyGithub

El uso del método .get_contents() para intentar descargar un archivo grande genera el error:

{'errors': [{'code': 'too_large', 'field': 'data',
     'resource': 'Blob'}],
     'message': 'This API returns blobs up to 1 MB in size. The requested blob is too large to fetch via the API, but you can use the Git Data API to request blobs up to 100 MB in size.',
     'documentation_url': 'https://developer.github.com/v3/repos/contents/#get-contents'}

¿Hay alguna forma de detectar esto y pasar a otro controlador que pueda descargar el archivo?

Por ejemplo, si algo como esto falla:

contents = repository.get_dir_contents(urllib.parse.quote(server_path), ref=sha)

for content in contents:
   if content.type != 'dir':
     file_content = repository.get_contents(urllib.parse.quote(content.path), ref=sha)

opcionalmente volver a:

file_content = repository.get_git_blob(content.sha)

question

Fuente

psychemedia

Comentario más útil

Tengo el mismo problema y termino haciendo algo parecido a.

si volcamos todos los archivos de un directorio y algunos son más grandes que 1M,

 file_contents = repo.get_contents(dir_name, ref=branch)

entonces sha existe para cada file_content , y lo siguiente podría usarse para tomar el blob de cada archivo

for file_content in file_contents:
    try:
        if file_content.encoding != 'base64':
            # some error ...
        # ok... 
    except GithubException:
        # if file_content DOES NOT HAVE encoding, it is a large file 
        blob = repo.get_git_blob(file_content.sha)
        # do something with blob

Si path_name refiere a un solo archivo que es más grande que 1M, tiene que ser algún bloque de prueba / excepción como sigue:

        try:
            res = repo.get_contents(path_name, ref=branch)
            # ok, we have the content
        except GithubException:
           return get_blob_content(repo, branch, path_name)

donde get_blob_content es algo así como

def get_blob_content(repo, branch, path_name):
    # first get the branch reference
    ref = repo.get_git_ref(f'heads/{branch}')
    # then get the tree
    tree = repo.get_git_tree(ref.object.sha, recursive='/' in path_name).tree
    # look for path in tree
    sha = [x.sha for x in tree if x.path == path_name]
    if not sha:
        # well, not found..
        return None
    # we have sha
    return repo.get_git_blob(sha[0])

El código real con verificación de errores es más largo, pero la idea está aquí.

BoPeng en 11 may. 2020

👍3 🎉1

Todos 5 comentarios

También me he encontrado con este problema antes. En mi caso, como siempre tuve el SHA del blob, usé git_git_blob lugar.

Sin embargo, get_git_blob no funciona para ningún tipo de objeto además de blob (de ahí el nombre). Necesita saber el tipo de objeto antes de intentar llamarlo.

Para hacer la reserva, necesita conocer dos datos:

El tipo de objeto.
El SHA del objeto.

Si get_contents falla, no le dice ninguna de estas cosas. Por lo que yo sé, no hay realmente una buena forma de hacer el respaldo.

jasonwhite en 27 nov. 2017

Cerrado como wontfix . Si alguien tiene una buena idea sobre cómo resolver esto, estoy feliz de reabrir. Por lo que puedo decir, no parece que sea posible hacerlo de forma limpia.

jasonwhite en 8 dic. 2017

Tengo el mismo problema y termino haciendo algo parecido a.

si volcamos todos los archivos de un directorio y algunos son más grandes que 1M,

 file_contents = repo.get_contents(dir_name, ref=branch)

entonces sha existe para cada file_content , y lo siguiente podría usarse para tomar el blob de cada archivo

for file_content in file_contents:
    try:
        if file_content.encoding != 'base64':
            # some error ...
        # ok... 
    except GithubException:
        # if file_content DOES NOT HAVE encoding, it is a large file 
        blob = repo.get_git_blob(file_content.sha)
        # do something with blob

Si path_name refiere a un solo archivo que es más grande que 1M, tiene que ser algún bloque de prueba / excepción como sigue:

        try:
            res = repo.get_contents(path_name, ref=branch)
            # ok, we have the content
        except GithubException:
           return get_blob_content(repo, branch, path_name)

donde get_blob_content es algo así como

def get_blob_content(repo, branch, path_name):
    # first get the branch reference
    ref = repo.get_git_ref(f'heads/{branch}')
    # then get the tree
    tree = repo.get_git_tree(ref.object.sha, recursive='/' in path_name).tree
    # look for path in tree
    sha = [x.sha for x in tree if x.path == path_name]
    if not sha:
        # well, not found..
        return None
    # we have sha
    return repo.get_git_blob(sha[0])

El código real con verificación de errores es más largo, pero la idea está aquí.

BoPeng en 11 may. 2020

👍3 🎉1

Cuando obtenga el blob, el siguiente código será útil.

    blob = repo.get_git_blob(sha[0])
    b64 = base64.b64decode(blob.content)
    return b64.decode("utf8")

Además, el archivo de actualización también se encontrará con este problema.

eeechoo en 22 jul. 2020

👍2

raise self.__createException(status, responseHeaders, output)

github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": " https://docs.github.com/rest/reference/repos#get -repository-content"} obteniendo este error al intentar para descargar un repositorio de archivos para la rama maestra