compression error -2
μ€λ₯κ° λ°μνμ΅λλ€. λꡬλ μ§ ν¬μΈν°λ₯Ό μ 곡 ν μ μλ€λ©΄ μ’μ κ²μ
λλ€.
λ¬Έμ κ° μλ PDFλ₯Ό 첨λΆνμ΅λλ€.
5_KO.pdf
μλ¬ λ©μμ§:
Processing Pages: 1/28...mupdf: compression error -2
Traceback (most recent call last):
File "/Users/erikchan/Downloads/convert.py", line 10, in <module>
parse(pdf_files[i], docx_files[i])
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/main.py", line 31, in parse
cv.make_docx(indexes, multi_processing)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 118, in make_docx
self._make_docx(page_indexes)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 192, in _make_docx
self.initialize(page).parse().make_page(self.doc_docx)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/converter.py", line 172, in initialize
images, paths = self._paths_extractor.extract_paths(page)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/shape/Path.py", line 61, in extract_paths
image = largest.to_image(page) if largest.contains_curve else None
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/shape/Path.py", line 140, in to_image
return ImagesExtractor.clip_page(page, bbox, zoom)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/image/Image.py", line 60, in clip_page
return cls.to_raw_dict(image, bbox)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pdf2docx/image/Image.py", line 50, in to_raw_dict
'image': image.getPNGData()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fitz/fitz.py", line 5899, in getPNGData
barray = self._getImageData(1)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/fitz/fitz.py", line 5868, in _getImageData
return _fitz.Pixmap__getImageData(self, format)
RuntimeError: compression error -2
μ΄ μΌμ΄μ€λ₯Ό μ κ³΅ν΄ μ£Όμ μ κ°μ¬ν©λλ€.
λ§μ λ²‘ν° κ·Έλν½, μ¦ μ , 곑μ λ° μ΄λ€μ μ‘°ν©κ³Ό κ°μ path
pdfμ μ‘΄μ¬ν©λλ€. κ·Έλ¬λ νμ¬ ν΄λ¦¬ν κ²½λ‘λ pdfμμ μ΄λ¬ν κ²½λ‘λ₯Ό μΆμΆν λ κΈ°μ μ μΈ λ¬Έμ λ‘ μΈν΄ μ΄ λΌμ΄λΈλ¬λ¦¬μμ 무μλ©λλ€. μΌλΆ κ²½λ‘κ° μλ¦¬μ§ μκ³ νμ΄μ§ μΈλΆμ μμ΄ compression error -2
λ¬Έμ κ° λ°μν©λλ€.
κ²λ€κ° μ΄ pdfλ₯Ό λ³ννλ λ° λ κ°μ§ λ¬Έμ κ° λ μμ΅λλ€.
κ²½λ‘ μμμ΄ μλͺ»λμμ΅λλ€. κ·Όλ³Έ μμΈμ νμ¬ Device Color Space
(Gray/RGB/CMYK)λ§ κ³ λ €λλ λ°λ©΄ μ΄ pdf μνμ Indexed CS
, DeviceN CS
μ κ°μ νΉμ μμ 곡κ°μ λ°λ₯Ό μ μλ€λ κ²μ
λλ€.
κ²ΉμΉ μ΄λ―Έμ§κ° μ κ±°λ©λλ€. python-docx
λ λ³νλ docxλ₯Ό μμ±νλ λ° μ μ©λμ§λ§ python-docx
λ νμ¬ λΆλ μμλ₯Ό μ§μνμ§ μμ΅λλ€. λ°λΌμ λΆλ μ΄λ―Έμ§λ ννμΌλ‘ μ κ±°λ©λλ€.
λ°λΌμ λΆννλ pdf2docx
μ(λ) νμ¬ pdfλ₯Ό λ³νν μ μμ΅λλ€. μ΅μν λ€μκ³Ό κ°μ λ
Έλ ₯μ κΈ°μΈμ¬μΌ ν©λλ€.
λͺ νν μ€λͺ μ λν΄ @dothinking μκ² κ°μ¬λ립λλ€. λλ μ΄ λμκ΄μ΄ μ§κΈλ³΄λ€ λ μ λͺ νμ§ μλ€λ κ²μ λλλ€. νμ¬ λ²μ μ μ΄λ―Έ λ§€μ° νλ₯νκ³ λ§μ μ¬λλ€μ΄ ννμ λ°μ μ μλ€λ κ²μ μκ³ μμ΅λλ€.
κ·νκ° λμ΄ν λ¬Έμ λ₯Ό ν΄κ²°νλ λ° λμμ΄ λ μ μλ λ°©λ²μ μλ €μ£ΌμΈμ.
@echan00 κ°μ¬ν©λλ€.
μ΄ λ¬Έμ μ λν λͺ κ°μ§ μ§ν μν©:
PyMuPDF
μμ κ²½λ‘ μΆμΆμ λν μλ‘μ΄ κΈ°λ₯μ κ²μνμ΅λλ€. μ‘°μ¬ν΄ λ³΄κ³ μ΄ λ¬Έμ λ₯Ό ν΄κ²°ν μ μκΈ°λ₯Ό λ°λλλ€.κ·Έ νμλ μ΄λ€ ν μ€νΈλ μ μλ νμν©λλ€.
2020-12-31μ λν λκΈ: μ΅μ PyMuPDF 1.18.5λ μ΄ λ¬Έμ λ₯Ό λΆλΆμ μΌλ‘ ν΄κ²°νμ§λ§ μλ²½νμ§λ μμ΅λλ€. νΉν ν΄λ¦¬ν κ²½λ‘κ° κ·Έλ μ΅λλ€.
μΈλΌμΈ μ΄λ―Έμ§λ python-docx
μμ μ§μλλ―λ‘ νλ‘ν
μ΄λ―Έμ§λ₯Ό νμνλ λ¨κ³λ λ€μκ³Ό κ°μ΅λλ€.
behind text
λͺ¨λ).xml ꡬ쑰 κ²°κ³Ό:
<w:drawing>
μλμ <wp:inline>
λ
Έλμ
λλ€.<w:drawing>
μλμ <wp:anchor>
λ
Έλμ
λλ€.<wp:positionH>
λ° <wp:positionV>
λ©λλ€.λ°λΌμ μμ΄λμ΄λ <wp:anchor>
λ
Έλλ₯Ό λ§λ λ€μ νμ λ
Έλλ₯Ό μΆκ°νλ κ²μ
λλ€.
<wp:positionH>
λ° <wp:positionV>
python-docx
κ° μλ λ μλ κ·Έλ¦Όμ΄ μΌλ°μ μΈ μμ²μΈ κ² κ°μ΅λλ€. 곡μ λ₯Ό μν΄ μ¬κΈ° λ¬Έμλ₯Ό μ°Έμ‘°νμΈμ.
# -*- coding: utf-8 -*-
'''
Implement floating image based on python-docx.
- Text wrapping style: BEHIND TEXT <wp:anchor behindDoc="1">
- Picture position: top-left corner of PAGE `<wp:positionH relativeFrom="page">`.
Create a docx sample (Layout | Positions | More Layout Options) and explore the
source xml (Open as a zip | word | document.xml) to implement other text wrapping
styles and position modes per `CT_Anchor._anchor_xml()`.
'''
from docx.oxml import parse_xml, register_element_cls
from docx.oxml.ns import nsdecls
from docx.oxml.shape import CT_Picture
from docx.oxml.xmlchemy import BaseOxmlElement, OneAndOnlyOne
# refer to docx.oxml.shape.CT_Inline
class CT_Anchor(BaseOxmlElement):
"""
``<w:anchor>`` element, container for a floating image.
"""
extent = OneAndOnlyOne('wp:extent')
docPr = OneAndOnlyOne('wp:docPr')
graphic = OneAndOnlyOne('a:graphic')
<strong i="7">@classmethod</strong>
def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):
"""
Return a new ``<wp:anchor>`` element populated with the values passed
as parameters.
"""
anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))
anchor.extent.cx = cx
anchor.extent.cy = cy
anchor.docPr.id = shape_id
anchor.docPr.name = 'Picture %d' % shape_id
anchor.graphic.graphicData.uri = (
'http://schemas.openxmlformats.org/drawingml/2006/picture'
)
anchor.graphic.graphicData._insert_pic(pic)
return anchor
<strong i="8">@classmethod</strong>
def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):
"""
Return a new `wp:anchor` element containing the `pic:pic` element
specified by the argument values.
"""
pic_id = 0 # Word doesn't seem to use this, but does not omit it
pic = CT_Picture.new(pic_id, filename, rId, cx, cy)
anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)
anchor.graphic.graphicData._insert_pic(pic)
return anchor
<strong i="9">@classmethod</strong>
def _anchor_xml(cls, pos_x, pos_y):
return (
'<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'
' behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'
' %s>\n'
' <wp:simplePos x="0" y="0"/>\n'
' <wp:positionH relativeFrom="page">\n'
' <wp:posOffset>%d</wp:posOffset>\n'
' </wp:positionH>\n'
' <wp:positionV relativeFrom="page">\n'
' <wp:posOffset>%d</wp:posOffset>\n'
' </wp:positionV>\n'
' <wp:extent cx="914400" cy="914400"/>\n'
' <wp:wrapNone/>\n'
' <wp:docPr id="666" name="unnamed"/>\n'
' <wp:cNvGraphicFramePr>\n'
' <a:graphicFrameLocks noChangeAspect="1"/>\n'
' </wp:cNvGraphicFramePr>\n'
' <a:graphic>\n'
' <a:graphicData uri="URI not set"/>\n'
' </a:graphic>\n'
'</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )
)
# refer to docx.parts.story.BaseStoryPart.new_pic_inline
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):
"""Return a newly-created `w:anchor` element.
The element contains the image specified by *image_descriptor* and is scaled
based on the values of *width* and *height*.
"""
rId, image = part.get_or_add_image(image_descriptor)
cx, cy = image.scaled_dimensions(width, height)
shape_id, filename = part.next_id, image.filename
return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)
# refer to docx.text.run.add_picture
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):
"""Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.
"""
run = p.add_run()
anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)
run._r.add_drawing(anchor)
# refer to docx.oxml.shape.__init__.py
register_element_cls('wp:anchor', CT_Anchor)
if __name__ == '__main__':
from docx import Document
from docx.shared import Inches, Pt
document = Document()
# add a floating image
p = document.add_paragraph()
add_float_picture(p, 'test.png', width=Inches(5.0), pos_x=Pt(20), pos_y=Pt(30))
# add text
p.add_run('Hello World'*50)
document.save('output.docx')
μ’μ @dothinking , λ¬Έμ κ° μ νν 무μμΈμ§ μκ³ μλ κ² κ°μ΅λλ€. λ€μν PDFκ° μμ΅λλ€. μ€λΉκ° λλ©΄ ν μ€νΈλ₯Ό λμλ릴 μ μμ΅λλ€.
@dothinking κ·νμ μ½λ μνμ λν΄
μ΄ νλ‘μ νΈμ λ무 μ€λ«λμ μκ°μ ββν μ νμ§ λͺ»νμ΅λλ€. μ΄μ μ΄ λ¬Έμ λ₯Ό λΆλΆμ μΌλ‘ ν΄κ²°ν μ μλ μ λ²μ v0.5.0
μ¬μ©ν μ μμ΅λλ€.
PyMuPDF
μμ μ§μλμ§λ§ ν΄λ¦¬ν κ²½λ‘μ κ°μ 볡μ‘ν λͺ¨μμλ μ ν©νμ§ μμ΅λλ€.μ΄ μ΅μ λ²μ μ μ¬μ©νλ©΄ μν pdfλ₯Ό μ±κ³΅μ μΌλ‘ λ³νν μ μμ§λ§ 볡μ‘νκ³ νλ €ν μ€νμΌλ‘ μΈν΄ λ³νλ docx νμΌμ νμ§μ λμ΄λ €λ©΄ μ¬μ ν λ§μ μμ μ΄ νμν©λλ€.
μμ° μ΄κ²μ νλ₯ν μ κ·Έλ μ΄λμ λλ€. @dothinkingμ λ Έκ³ μ μ§μ¬μΌλ‘ κ°μ¬λ립λλ€.
κ°μ₯ μ μ©ν λκΈ
python-docx
κ° μλ λ μλ κ·Έλ¦Όμ΄ μΌλ°μ μΈ μμ²μΈ κ² κ°μ΅λλ€. 곡μ λ₯Ό μν΄ μ¬κΈ° λ¬Έμλ₯Ό μ°Έμ‘°νμΈμ.