Pygithub: Baixando arquivos grandes

Criado em 20 nov. 2017 · 5Comentários · Fonte: PyGithub/PyGithub

Usar o método .get_contents() para tentar baixar um arquivo grande gera o erro:

{'errors': [{'code': 'too_large', 'field': 'data',
     'resource': 'Blob'}],
     'message': 'This API returns blobs up to 1 MB in size. The requested blob is too large to fetch via the API, but you can use the Git Data API to request blobs up to 100 MB in size.',
     'documentation_url': 'https://developer.github.com/v3/repos/contents/#get-contents'}

Existe uma maneira de detectar isso e passar para outro manipulador que pode baixar o arquivo?

Por exemplo, se algo assim falhar:

contents = repository.get_dir_contents(urllib.parse.quote(server_path), ref=sha)

for content in contents:
   if content.type != 'dir':
     file_content = repository.get_contents(urllib.parse.quote(content.path), ref=sha)

opcionalmente, reverter para:

file_content = repository.get_git_blob(content.sha)

question

Fonte

psychemedia

Comentários muito úteis

Eu tenho o mesmo problema e acabo fazendo algo parecido com.

se despejarmos todos os arquivos de um diretório e alguns são maiores que 1M,

 file_contents = repo.get_contents(dir_name, ref=branch)

então sha existe para cada file_content , e o seguinte pode ser usado para capturar o blob de cada arquivo

for file_content in file_contents:
    try:
        if file_content.encoding != 'base64':
            # some error ...
        # ok... 
    except GithubException:
        # if file_content DOES NOT HAVE encoding, it is a large file 
        blob = repo.get_git_blob(file_content.sha)
        # do something with blob

Se path_name se refere a um único arquivo maior que 1M, deve ser algum bloco de tentativa / exceção como segue:

        try:
            res = repo.get_contents(path_name, ref=branch)
            # ok, we have the content
        except GithubException:
           return get_blob_content(repo, branch, path_name)

onde get_blob_content é algo como

def get_blob_content(repo, branch, path_name):
    # first get the branch reference
    ref = repo.get_git_ref(f'heads/{branch}')
    # then get the tree
    tree = repo.get_git_tree(ref.object.sha, recursive='/' in path_name).tree
    # look for path in tree
    sha = [x.sha for x in tree if x.path == path_name]
    if not sha:
        # well, not found..
        return None
    # we have sha
    return repo.get_git_blob(sha[0])

O código real com verificação de erros é mais longo, mas a ideia está aqui.

BoPeng em 11 mai. 2020

👍3 🎉1

Todos 5 comentários

Eu já tive esse problema antes também. No meu caso, como sempre tive o SHA do blob, usei apenas git_git_blob .

No entanto, get_git_blob não funciona para nenhum tipo de objeto além de blob (daí o nome). Você precisa saber o tipo de objeto antes de tentar chamá-lo.

Para fazer o fallback, você precisa saber duas informações:

O tipo do objeto.
O SHA do objeto.

Se get_contents falhar, isso não diz a você nenhuma dessas coisas. Não há realmente uma boa maneira de fazer o fallback, pelo que posso dizer.

jasonwhite em 27 nov. 2017

Fechado como wontfix . Se alguém tiver uma boa ideia de como resolver isso, fico feliz em reabrir. Pelo que eu posso dizer, não parece que seja possível fazer de uma maneira limpa.

jasonwhite em 8 dez. 2017

Eu tenho o mesmo problema e acabo fazendo algo parecido com.

se despejarmos todos os arquivos de um diretório e alguns são maiores que 1M,

 file_contents = repo.get_contents(dir_name, ref=branch)

então sha existe para cada file_content , e o seguinte pode ser usado para capturar o blob de cada arquivo

for file_content in file_contents:
    try:
        if file_content.encoding != 'base64':
            # some error ...
        # ok... 
    except GithubException:
        # if file_content DOES NOT HAVE encoding, it is a large file 
        blob = repo.get_git_blob(file_content.sha)
        # do something with blob

Se path_name se refere a um único arquivo maior que 1M, deve ser algum bloco de tentativa / exceção como segue:

        try:
            res = repo.get_contents(path_name, ref=branch)
            # ok, we have the content
        except GithubException:
           return get_blob_content(repo, branch, path_name)

onde get_blob_content é algo como

def get_blob_content(repo, branch, path_name):
    # first get the branch reference
    ref = repo.get_git_ref(f'heads/{branch}')
    # then get the tree
    tree = repo.get_git_tree(ref.object.sha, recursive='/' in path_name).tree
    # look for path in tree
    sha = [x.sha for x in tree if x.path == path_name]
    if not sha:
        # well, not found..
        return None
    # we have sha
    return repo.get_git_blob(sha[0])

O código real com verificação de erros é mais longo, mas a ideia está aqui.

BoPeng em 11 mai. 2020

👍3 🎉1

Ao obter o blob, o código a seguir será útil.

    blob = repo.get_git_blob(sha[0])
    b64 = base64.b64decode(blob.content)
    return b64.decode("utf8")

Além disso, o arquivo de atualização também encontrará esse problema.

eeechoo em 22 jul. 2020

👍2

raise self.__createException(status, responseHeaders, output)

github.GithubException.UnknownObjectException: 404 {"message": "Not Found", "documentation_url": " https://docs.github.com/rest/reference/repos#get -repository-content"} recebendo este erro ao tentar para baixar arquivos de repositório obtidos para o branch master