Pygithub: 大きなファイルのダウンロード

作成日 2017年11月20日 · 5コメント · ソース: PyGithub/PyGithub

.get_contents()メソッドを使用して大きなファイルをダウンロードしようとすると、エラーが発生します。

{'errors': [{'code': 'too_large', 'field': 'data',
     'resource': 'Blob'}],
     'message': 'This API returns blobs up to 1 MB in size. The requested blob is too large to fetch via the API, but you can use the Git Data API to request blobs up to 100 MB in size.',
     'documentation_url': 'https://developer.github.com/v3/repos/contents/#get-contents'}

これを検出して、ファイルをダウンロード

たとえば、このようなことが失敗した場合：

contents = repository.get_dir_contents(urllib.parse.quote(server_path), ref=sha)

for content in contents:
   if content.type != 'dir':
     file_content = repository.get_contents(urllib.parse.quote(content.path), ref=sha)

オプションで次のように戻します。

file_content = repository.get_git_blob(content.sha)

question

ソース

psychemedia

最も参考になるコメント

私は同じ問題を抱えており、の線に沿って何かをすることになります。

ディレクトリからすべてのファイルをダンプし、一部が1Mより大きい場合、

 file_contents = repo.get_contents(dir_name, ref=branch)

次に、各file_contentにshaが存在し、以下を使用して各ファイルのBLOBを取得できます。

for file_content in file_contents:
    try:
        if file_content.encoding != 'base64':
            # some error ...
        # ok... 
    except GithubException:
        # if file_content DOES NOT HAVE encoding, it is a large file 
        blob = repo.get_git_blob(file_content.sha)
        # do something with blob

path_nameが1Mより大きい単一のファイルを参照している場合、次のようなtry / exceptionブロックである必要があります。

        try:
            res = repo.get_contents(path_name, ref=branch)
            # ok, we have the content
        except GithubException:
           return get_blob_content(repo, branch, path_name)

ここで、 get_blob_contentは次のようなものです。

def get_blob_content(repo, branch, path_name):
    # first get the branch reference
    ref = repo.get_git_ref(f'heads/{branch}')
    # then get the tree
    tree = repo.get_git_tree(ref.object.sha, recursive='/' in path_name).tree
    # look for path in tree
    sha = [x.sha for x in tree if x.path == path_name]
    if not sha:
        # well, not found..
        return None
    # we have sha
    return repo.get_git_blob(sha[0])

エラーチェックを伴う実際のコードは長くなりますが、アイデアはここにあります。

BoPeng 2020年05月11日

👍3 🎉1

全てのコメント5件

私も以前にこの問題に遭遇しました。私の場合、私は常にブロブのSHAを持っていたので、代わりにgit_git_blob使用しました。

ただし、 get_git_blobは、 blob以外のオブジェクトタイプでは機能しません（そのため、この名前が付けられています）。オブジェクトを呼び出す前に、オブジェクトのタイプを知っておく必要があります。

フォールバックを実行するには、次の2つの情報を知っている必要があります。

オブジェクトのタイプ。
オブジェクトのSHA。

get_contentsが失敗した場合、これらのいずれも通知されません。私が知る限り、フォールバックを行う良い方法は実際にはありません。

jasonwhite 2017年11月27日

wontfixとしてクローズ。誰かがこれを解決する方法について良いアイデアを持っているなら、私は再開してうれしいです。私の知る限り、それはきれいな方法で行うことが可能であるようには見えません。

jasonwhite 2017年12月08日

私は同じ問題を抱えており、の線に沿って何かをすることになります。

ディレクトリからすべてのファイルをダンプし、一部が1Mより大きい場合、

 file_contents = repo.get_contents(dir_name, ref=branch)

次に、各file_contentにshaが存在し、以下を使用して各ファイルのBLOBを取得できます。

for file_content in file_contents:
    try:
        if file_content.encoding != 'base64':
            # some error ...
        # ok... 
    except GithubException:
        # if file_content DOES NOT HAVE encoding, it is a large file 
        blob = repo.get_git_blob(file_content.sha)
        # do something with blob

path_nameが1Mより大きい単一のファイルを参照している場合、次のようなtry / exceptionブロックである必要があります。

        try:
            res = repo.get_contents(path_name, ref=branch)
            # ok, we have the content
        except GithubException:
           return get_blob_content(repo, branch, path_name)

ここで、 get_blob_contentは次のようなものです。

def get_blob_content(repo, branch, path_name):
    # first get the branch reference
    ref = repo.get_git_ref(f'heads/{branch}')
    # then get the tree
    tree = repo.get_git_tree(ref.object.sha, recursive='/' in path_name).tree
    # look for path in tree
    sha = [x.sha for x in tree if x.path == path_name]
    if not sha:
        # well, not found..
        return None
    # we have sha
    return repo.get_git_blob(sha[0])

エラーチェックを伴う実際のコードは長くなりますが、アイデアはここにあります。

BoPeng 2020年05月11日

👍3 🎉1

ブロブを取得するときは、次のコードが役立ちます。

    blob = repo.get_git_blob(sha[0])
    b64 = base64.b64decode(blob.content)
    return b64.decode("utf8")

また、更新ファイルでもこの問題が発生します。

eeechoo 2020年07月22日

👍2

raise self.__createException(status, responseHeaders, output)

github.GithubException.UnknownObjectException：404 {"message"： "Not Found"、 "documentation_url"： " https://docs.github.com/rest/reference/repos#get-repository-content "}試行時にこのエラーが発生するマスターブランチ用に取得したリポジトリファイルをダウンロードするには

bhushanladdad 2021年04月28日

このページは役に立ちましたか？

0 / 5 - 0 評価

Pygithub: 大きなファイルのダウンロード

最も参考になるコメント

全てのコメント5件

関連する問題