Pdf2docx: erreur de compression -2

Créé le 20 oct. 2020 · 9Commentaires · Source: dothinking/pdf2docx

Une erreur compression error -2 . Ce serait super si quelqu'un pouvait donner des indications

Attaché le PDF avec le problème:
5_FR.pdf

Message d'erreur:

Processing Pages: 1/28...mupdf: compression error -2
Traceback (most recent call last):
  File "/Users/erikchan/Downloads/convert.py", line 10, in <module>
    parse(pdf_files[i], docx_files[i])
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/main.py", line 31, in parse
    cv.make_docx(indexes, multi_processing)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 118, in make_docx
    self._make_docx(page_indexes)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 192, in _make_docx
    self.initialize(page).parse().make_page(self.doc_docx)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 172, in initialize
    images, paths = self._paths_extractor.extract_paths(page)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/shape/Path.py", line 61, in extract_paths
    image = largest.to_image(page) if largest.contains_curve else None
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/shape/Path.py", line 140, in to_image
    return ImagesExtractor.clip_page(page, bbox, zoom)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/image/Image.py", line 60, in clip_page
    return cls.to_raw_dict(image, bbox)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/image/Image.py", line 50, in to_raw_dict
    'image': image.getPNGData()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fitz/fitz.py", line 5899, in getPNGData
    barray = self._getImageData(1)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fitz/fitz.py", line 5868, in _getImageData
    return _fitz.Pixmap__getImageData(self, format)
RuntimeError: compression error -2

bug enhancement

Source

echan00

Commentaire le plus utile

Il semble que cette image flottante avec python-docx soit une demande courante, document ici pour le partage.

# -*- coding: utf-8 -*-

'''
Implement floating image based on python-docx.

- Text wrapping style: BEHIND TEXT <wp:anchor behindDoc="1">
- Picture position: top-left corner of PAGE `<wp:positionH relativeFrom="page">`.

Create a docx sample (Layout | Positions | More Layout Options) and explore the 
source xml (Open as a zip | word | document.xml) to implement other text wrapping
styles and position modes per `CT_Anchor._anchor_xml()`.
'''

from docx.oxml import parse_xml, register_element_cls
from docx.oxml.ns import nsdecls
from docx.oxml.shape import CT_Picture
from docx.oxml.xmlchemy import BaseOxmlElement, OneAndOnlyOne

# refer to docx.oxml.shape.CT_Inline
class CT_Anchor(BaseOxmlElement):
    """
    ``<w:anchor>`` element, container for a floating image.
    """
    extent = OneAndOnlyOne('wp:extent')
    docPr = OneAndOnlyOne('wp:docPr')
    graphic = OneAndOnlyOne('a:graphic')

    <strong i="7">@classmethod</strong>
    def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):
        """
        Return a new ``<wp:anchor>`` element populated with the values passed
        as parameters.
        """
        anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))
        anchor.extent.cx = cx
        anchor.extent.cy = cy
        anchor.docPr.id = shape_id
        anchor.docPr.name = 'Picture %d' % shape_id
        anchor.graphic.graphicData.uri = (
            'http://schemas.openxmlformats.org/drawingml/2006/picture'
        )
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    <strong i="8">@classmethod</strong>
    def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):
        """
        Return a new `wp:anchor` element containing the `pic:pic` element
        specified by the argument values.
        """
        pic_id = 0  # Word doesn't seem to use this, but does not omit it
        pic = CT_Picture.new(pic_id, filename, rId, cx, cy)
        anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    <strong i="9">@classmethod</strong>
    def _anchor_xml(cls, pos_x, pos_y):
        return (
            '<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'
            '           behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'
            '           %s>\n'
            '  <wp:simplePos x="0" y="0"/>\n'
            '  <wp:positionH relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionH>\n'
            '  <wp:positionV relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionV>\n'                    
            '  <wp:extent cx="914400" cy="914400"/>\n'
            '  <wp:wrapNone/>\n'
            '  <wp:docPr id="666" name="unnamed"/>\n'
            '  <wp:cNvGraphicFramePr>\n'
            '    <a:graphicFrameLocks noChangeAspect="1"/>\n'
            '  </wp:cNvGraphicFramePr>\n'
            '  <a:graphic>\n'
            '    <a:graphicData uri="URI not set"/>\n'
            '  </a:graphic>\n'
            '</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )
        )


# refer to docx.parts.story.BaseStoryPart.new_pic_inline
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):
    """Return a newly-created `w:anchor` element.

    The element contains the image specified by *image_descriptor* and is scaled
    based on the values of *width* and *height*.
    """
    rId, image = part.get_or_add_image(image_descriptor)
    cx, cy = image.scaled_dimensions(width, height)
    shape_id, filename = part.next_id, image.filename    
    return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)


# refer to docx.text.run.add_picture
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):
    """Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.
    """
    run = p.add_run()
    anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)
    run._r.add_drawing(anchor)

# refer to docx.oxml.shape.__init__.py
register_element_cls('wp:anchor', CT_Anchor)


if __name__ == '__main__':

    from docx import Document
    from docx.shared import Inches, Pt

    document = Document()

    # add a floating image
    p = document.add_paragraph()
    add_float_picture(p, 'test.png', width=Inches(5.0), pos_x=Pt(20), pos_y=Pt(30))

    # add text
    p.add_run('Hello World'*50)


    document.save('output.docx')

dothinking le 24 oct. 2020

🎉2 👍2 🚀1

Tous les 9 commentaires

Merci d'avoir fourni ce cas.

Beaucoup de graphiques vectoriels, c'est- path dire compression error -2 .

En plus, deux autres problèmes pour convertir ce pdf :

La couleur du chemin est incorrecte. Je suppose que la cause première est qu'actuellement, seuls Device Color Space (Gray/RGB/CMYK) sont pris en compte, alors que cet exemple de pdf peut suivre un espace colorimétrique spécial comme Indexed CS , DeviceN CS .
les images superposées sont supprimées. python-docx est appliqué pour écrire le docx converti, mais python-docx ne prend pas en charge les éléments flottants maintenant. Ainsi, les images flottantes sont supprimées en guise de compromis.

Donc, malheureusement, pdf2docx n'est pas en mesure de convertir votre pdf pour le moment. Au moins, les efforts suivants devraient être faits :

chemin de détourage lors de l'extraction des chemins d'un pdf
implémenter plus d'espace colorimétrique
introduire des images flottantes

dothinking le 21 oct. 2020

👍1

Merci @dothinking pour l'explication claire. Je suis surpris que cette bibliothèque ne soit pas plus populaire qu'elle ne l'est. La version actuelle est déjà très bonne et je sais que beaucoup de gens peuvent en bénéficier.

S'il vous plaît laissez-moi savoir comment je peux aider à résoudre l'un des problèmes que vous avez énumérés (j'aurai besoin de conseils.) Que ce soit pour résoudre les bogues, tester ou autrement.

echan00 le 21 oct. 2020

👍1

Merci beaucoup @echan00.

Quelques avancées sur cette question :

L' image flottante [x]
[ ] chemin du clip et espace colorimétrique -> bonne nouvelle qu'une autre bibliothèque en amont PyMuPDF publié une nouvelle fonctionnalité sur l'extraction du chemin. Je vais me renseigner et j'espère pouvoir résoudre ce problème.

Après cela, tout test ou suggestion est apprécié.

Commentaire sur 2020-12-31 : le dernier PyMuPDF 1.18.5 a résolu ce problème en partie, mais pas parfaitement, en particulier le chemin de détourage.

dothinking le 24 oct. 2020

👍1

Étant donné que l'image en ligne est prise en charge dans python-docx , les étapes pour explorer l'image flottante :

créer deux fichiers docx, un avec une image en ligne et un autre une image flottante (dans ce cas, le mode behind text )
vérifier la différence de source xml entre ces deux fichiers
implémenter une image flottante basée sur la structure et le code observés pour l'image en ligne

résultats de la structure xml :

l'image en ligne est un nœud <wp:inline> sous <w:drawing>
l'image flottante est un nœud <wp:anchor> sous <w:drawing>
en plus de tous les sous-nœuds de l'image en ligne, l'image flottante contient également <wp:positionH> et <wp:positionV> pour définir la position fixe

Donc, l'idée est de créer <wp:anchor> nœud

tous les nœuds sont identiques avec l'image en ligne
<wp:positionH> et <wp:positionV>

dothinking le 24 oct. 2020

Il semble que cette image flottante avec python-docx soit une demande courante, document ici pour le partage.

# -*- coding: utf-8 -*-

'''
Implement floating image based on python-docx.

- Text wrapping style: BEHIND TEXT <wp:anchor behindDoc="1">
- Picture position: top-left corner of PAGE `<wp:positionH relativeFrom="page">`.

Create a docx sample (Layout | Positions | More Layout Options) and explore the 
source xml (Open as a zip | word | document.xml) to implement other text wrapping
styles and position modes per `CT_Anchor._anchor_xml()`.
'''

from docx.oxml import parse_xml, register_element_cls
from docx.oxml.ns import nsdecls
from docx.oxml.shape import CT_Picture
from docx.oxml.xmlchemy import BaseOxmlElement, OneAndOnlyOne

# refer to docx.oxml.shape.CT_Inline
class CT_Anchor(BaseOxmlElement):
    """
    ``<w:anchor>`` element, container for a floating image.
    """
    extent = OneAndOnlyOne('wp:extent')
    docPr = OneAndOnlyOne('wp:docPr')
    graphic = OneAndOnlyOne('a:graphic')

    <strong i="7">@classmethod</strong>
    def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):
        """
        Return a new ``<wp:anchor>`` element populated with the values passed
        as parameters.
        """
        anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))
        anchor.extent.cx = cx
        anchor.extent.cy = cy
        anchor.docPr.id = shape_id
        anchor.docPr.name = 'Picture %d' % shape_id
        anchor.graphic.graphicData.uri = (
            'http://schemas.openxmlformats.org/drawingml/2006/picture'
        )
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    <strong i="8">@classmethod</strong>
    def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):
        """
        Return a new `wp:anchor` element containing the `pic:pic` element
        specified by the argument values.
        """
        pic_id = 0  # Word doesn't seem to use this, but does not omit it
        pic = CT_Picture.new(pic_id, filename, rId, cx, cy)
        anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    <strong i="9">@classmethod</strong>
    def _anchor_xml(cls, pos_x, pos_y):
        return (
            '<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'
            '           behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'
            '           %s>\n'
            '  <wp:simplePos x="0" y="0"/>\n'
            '  <wp:positionH relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionH>\n'
            '  <wp:positionV relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionV>\n'                    
            '  <wp:extent cx="914400" cy="914400"/>\n'
            '  <wp:wrapNone/>\n'
            '  <wp:docPr id="666" name="unnamed"/>\n'
            '  <wp:cNvGraphicFramePr>\n'
            '    <a:graphicFrameLocks noChangeAspect="1"/>\n'
            '  </wp:cNvGraphicFramePr>\n'
            '  <a:graphic>\n'
            '    <a:graphicData uri="URI not set"/>\n'
            '  </a:graphic>\n'
            '</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )
        )


# refer to docx.parts.story.BaseStoryPart.new_pic_inline
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):
    """Return a newly-created `w:anchor` element.

    The element contains the image specified by *image_descriptor* and is scaled
    based on the values of *width* and *height*.
    """
    rId, image = part.get_or_add_image(image_descriptor)
    cx, cy = image.scaled_dimensions(width, height)
    shape_id, filename = part.next_id, image.filename    
    return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)


# refer to docx.text.run.add_picture
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):
    """Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.
    """
    run = p.add_run()
    anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)
    run._r.add_drawing(anchor)

# refer to docx.oxml.shape.__init__.py
register_element_cls('wp:anchor', CT_Anchor)


if __name__ == '__main__':

    from docx import Document
    from docx.shared import Inches, Pt

    document = Document()

    # add a floating image
    p = document.add_paragraph()
    add_float_picture(p, 'test.png', width=Inches(5.0), pos_x=Pt(20), pos_y=Pt(30))

    # add text
    p.add_run('Hello World'*50)


    document.save('output.docx')

dothinking le 24 oct. 2020

🎉2 👍2 🚀1

Nice @dothinking , il semble que vous sachiez quels sont exactement les problèmes. J'ai une variété de PDF que je peux aider à tester une fois que vous êtes prêt

echan00 le 26 oct. 2020

@dothinking merci beaucoup pour votre exemple de code ! Résout parfaitement mon problème !!!!

tonysepia le 24 nov. 2020

Je n'ai pas eu le temps de ce projet depuis si longtemps. La nouvelle version v0.5.0 est maintenant disponible pour résoudre en partie ce problème :

l'image flottante est maintenant prise en charge.
l'extraction de chemin est prise en charge par la bibliothèque en amont PyMuPDF , mais pas aussi bien pour les formes compliquées, par exemple un chemin de détourage.

Avec cette dernière version, l'exemple de pdf peut être converti avec succès, mais nécessite encore beaucoup de travail pour améliorer la qualité du fichier docx converti, en raison du style compliqué/magnifique.

dothinking le 31 déc. 2020

Wow, c'est une excellente mise à niveau. Merci beaucoup pour votre travail acharné @dothinking

echan00 le 1 janv. 2021

Cette page vous a été utile?

0 / 5 - 0 notes