Pdf2docx: Problem with the docx file after the convert

Created on 13 May 2021 · 7Comments · Source: dothinking/pdf2docx

Hello to the community, im new in the programming. So, thanks in advance, i run the program in pycharm, the Convert starts and seems to work without problems (Parsing Page... -> Creating Page... etc.) then, when i go to the directory that my file was saved, to check, if the conversion worked, i see what is shown in the attach picture (the docx file is shown like pictures, like pieces, not like text) and i was wonder , if you any idea why this happening and if you have any idea how to fix it.
problem

good first issue question

Source

JoHnTsIm

Most helpful comment

ok, i will have that in mind, Thanks for your quick reply. i dont know if you have already done user interface, but i have done a basic friendly user interface for your program and here is the code from this.

from pdf2docx import Converter
from tkinter import *
from tkinter.filedialog import *
from tkinter import filedialog

root = Tk()
root.title('PDF_2_Docx Converter')
root.geometry('500x500')
root.config(bg='grey')


def pdf_file_location():
    Tk().withdraw()
    filename = askopenfilename()
    file_path_pdf_entry.insert(0, filename)


def docx_folder_location():
    Tk().withdraw()
    folder_selected = filedialog.askdirectory() + "/" + 'New_DOCX.docx'
    file_path_docx_entry.insert(0, folder_selected)


def convert_button_function():
    cv = Converter(file_path_pdf_entry.get())
    cv.convert(file_path_docx_entry.get(), start=0, end=None)
    cv.close()


"""Labels"""
label1 = Label(text='PDF to Docx', font='Impact 40', bg='white', fg='#1E90FF')
label1.grid(column=2, row=1, sticky='n', pady=50, padx=120)


"""Entries"""

# PDF file entry
file_path_pdf_entry = Entry(border=5)
file_path_pdf_entry.grid(ipadx=90, ipady=4, padx=20, sticky='nw', column=2, pady=1, row=2)

# Docx file entry
file_path_docx_entry = Entry(border=5)
file_path_docx_entry.grid(column=2, ipady=4, ipadx=90, padx=20, sticky='nw', pady=70, row=3)

"""Buttons"""

# Convert Button
converter_button = Button(text='Convert', bg='#1E90FF', fg='white', font='impact 20', border=5,
                          command=convert_button_function)
converter_button.grid(padx=175, sticky='s', ipady=5, ipadx=10, column=2, row=4)

select_pdf_file = Button(text='Select PDF file', fg='black', bg='white', border=3,
                         command=pdf_file_location)
select_pdf_file.grid(column=2, sticky='ne', row=2, pady=6, padx=60)

select_new_file_folder = Button(text='Select new file folder', fg='black', bg='white', border=3,
                                command=docx_folder_location)
select_new_file_folder.grid(column=2, sticky='ne', row=3, pady=74, padx=26)


root.mainloop()

JoHnTsIm on 13 May 2021

❤2

All 7 comments

Hi, welcome. From the screenshot, I guess the "text" you saw is not real text. Can you copy and paste the text? It'd be great if you can upload the pdf (one page you failed is enough) for my test.

dothinking on 13 May 2021

one_page.pdf

The pdf i want to convert.

one_page.docx

The converted docx.

i hope this helps

JoHnTsIm on 13 May 2021

Sorry one limitation of pdf2docx is that it can process text-based pdf only. You pdf page consists of multi-pieces of images, which would not be ocr-ed, but copied to docx directly. The screenshot below shows the images in pdf.

dothinking on 13 May 2021

from pdf2docx import Converter
from tkinter import *
from tkinter.filedialog import *
from tkinter import filedialog

root = Tk()
root.title('PDF_2_Docx Converter')
root.geometry('500x500')
root.config(bg='grey')


def pdf_file_location():
    Tk().withdraw()
    filename = askopenfilename()
    file_path_pdf_entry.insert(0, filename)


def docx_folder_location():
    Tk().withdraw()
    folder_selected = filedialog.askdirectory() + "/" + 'New_DOCX.docx'
    file_path_docx_entry.insert(0, folder_selected)


def convert_button_function():
    cv = Converter(file_path_pdf_entry.get())
    cv.convert(file_path_docx_entry.get(), start=0, end=None)
    cv.close()


"""Labels"""
label1 = Label(text='PDF to Docx', font='Impact 40', bg='white', fg='#1E90FF')
label1.grid(column=2, row=1, sticky='n', pady=50, padx=120)


"""Entries"""

# PDF file entry
file_path_pdf_entry = Entry(border=5)
file_path_pdf_entry.grid(ipadx=90, ipady=4, padx=20, sticky='nw', column=2, pady=1, row=2)

# Docx file entry
file_path_docx_entry = Entry(border=5)
file_path_docx_entry.grid(column=2, ipady=4, ipadx=90, padx=20, sticky='nw', pady=70, row=3)

"""Buttons"""

# Convert Button
converter_button = Button(text='Convert', bg='#1E90FF', fg='white', font='impact 20', border=5,
                          command=convert_button_function)
converter_button.grid(padx=175, sticky='s', ipady=5, ipadx=10, column=2, row=4)

select_pdf_file = Button(text='Select PDF file', fg='black', bg='white', border=3,
                         command=pdf_file_location)
select_pdf_file.grid(column=2, sticky='ne', row=2, pady=6, padx=60)

select_new_file_folder = Button(text='Select new file folder', fg='black', bg='white', border=3,
                                command=docx_folder_location)
select_new_file_folder.grid(column=2, sticky='ne', row=3, pady=74, padx=26)


root.mainloop()

JoHnTsIm on 13 May 2021

❤2

Much appreciated. It's a good idea -> I'll put GUI into the backlog.

Would you like to make a bit more improvement, e.g. convert multi-pdf files under a user defined folder in a batch mode. After that, please submit a PR, so I can merge you work to this library to benefit more people.

dothinking on 14 May 2021

batch mode you mean, to save as batch file and run it? I can do it windows exe. what do you prefer?

JoHnTsIm on 15 May 2021

With your user interface, one can convert one file per time. But one might need to convert lots of pdf files, in such case, it's more convenient to put all pdf files in a folder, select that folder and convert them all per one go.

dothinking on 15 May 2021

Was this page helpful?

0 / 5 - 0 ratings