Pdf2docx: Problem with the docx file after the convert

Created on 13 May 2021  ·  7Comments  ·  Source: dothinking/pdf2docx

Hello to the community, im new in the programming. So, thanks in advance, i run the program in pycharm, the Convert starts and seems to work without problems (Parsing Page... -> Creating Page... etc.) then, when i go to the directory that my file was saved, to check, if the conversion worked, i see what is shown in the attach picture (the docx file is shown like pictures, like pieces, not like text) and i was wonder , if you any idea why this happening and if you have any idea how to fix it.
problem

good first issue question

Most helpful comment

ok, i will have that in mind, Thanks for your quick reply. i dont know if you have already done user interface, but i have done a basic friendly user interface for your program and here is the code from this.

from pdf2docx import Converter
from tkinter import *
from tkinter.filedialog import *
from tkinter import filedialog

root = Tk()
root.title('PDF_2_Docx Converter')
root.geometry('500x500')
root.config(bg='grey')


def pdf_file_location():
    Tk().withdraw()
    filename = askopenfilename()
    file_path_pdf_entry.insert(0, filename)


def docx_folder_location():
    Tk().withdraw()
    folder_selected = filedialog.askdirectory() + "/" + 'New_DOCX.docx'
    file_path_docx_entry.insert(0, folder_selected)


def convert_button_function():
    cv = Converter(file_path_pdf_entry.get())
    cv.convert(file_path_docx_entry.get(), start=0, end=None)
    cv.close()


"""Labels"""
label1 = Label(text='PDF to Docx', font='Impact 40', bg='white', fg='#1E90FF')
label1.grid(column=2, row=1, sticky='n', pady=50, padx=120)


"""Entries"""

# PDF file entry
file_path_pdf_entry = Entry(border=5)
file_path_pdf_entry.grid(ipadx=90, ipady=4, padx=20, sticky='nw', column=2, pady=1, row=2)

# Docx file entry
file_path_docx_entry = Entry(border=5)
file_path_docx_entry.grid(column=2, ipady=4, ipadx=90, padx=20, sticky='nw', pady=70, row=3)

"""Buttons"""

# Convert Button
converter_button = Button(text='Convert', bg='#1E90FF', fg='white', font='impact 20', border=5,
                          command=convert_button_function)
converter_button.grid(padx=175, sticky='s', ipady=5, ipadx=10, column=2, row=4)

select_pdf_file = Button(text='Select PDF file', fg='black', bg='white', border=3,
                         command=pdf_file_location)
select_pdf_file.grid(column=2, sticky='ne', row=2, pady=6, padx=60)

select_new_file_folder = Button(text='Select new file folder', fg='black', bg='white', border=3,
                                command=docx_folder_location)
select_new_file_folder.grid(column=2, sticky='ne', row=3, pady=74, padx=26)


root.mainloop()

All 7 comments

Hi, welcome. From the screenshot, I guess the "text" you saw is not real text. Can you copy and paste the text? It'd be great if you can upload the pdf (one page you failed is enough) for my test.

one_page.pdf

The pdf i want to convert.


one_page.docx

The converted docx.

i hope this helps

Sorry one limitation of pdf2docx is that it can process text-based pdf only. You pdf page consists of multi-pieces of images, which would not be ocr-ed, but copied to docx directly. The screenshot below shows the images in pdf.

image

ok, i will have that in mind, Thanks for your quick reply. i dont know if you have already done user interface, but i have done a basic friendly user interface for your program and here is the code from this.

from pdf2docx import Converter
from tkinter import *
from tkinter.filedialog import *
from tkinter import filedialog

root = Tk()
root.title('PDF_2_Docx Converter')
root.geometry('500x500')
root.config(bg='grey')


def pdf_file_location():
    Tk().withdraw()
    filename = askopenfilename()
    file_path_pdf_entry.insert(0, filename)


def docx_folder_location():
    Tk().withdraw()
    folder_selected = filedialog.askdirectory() + "/" + 'New_DOCX.docx'
    file_path_docx_entry.insert(0, folder_selected)


def convert_button_function():
    cv = Converter(file_path_pdf_entry.get())
    cv.convert(file_path_docx_entry.get(), start=0, end=None)
    cv.close()


"""Labels"""
label1 = Label(text='PDF to Docx', font='Impact 40', bg='white', fg='#1E90FF')
label1.grid(column=2, row=1, sticky='n', pady=50, padx=120)


"""Entries"""

# PDF file entry
file_path_pdf_entry = Entry(border=5)
file_path_pdf_entry.grid(ipadx=90, ipady=4, padx=20, sticky='nw', column=2, pady=1, row=2)

# Docx file entry
file_path_docx_entry = Entry(border=5)
file_path_docx_entry.grid(column=2, ipady=4, ipadx=90, padx=20, sticky='nw', pady=70, row=3)

"""Buttons"""

# Convert Button
converter_button = Button(text='Convert', bg='#1E90FF', fg='white', font='impact 20', border=5,
                          command=convert_button_function)
converter_button.grid(padx=175, sticky='s', ipady=5, ipadx=10, column=2, row=4)

select_pdf_file = Button(text='Select PDF file', fg='black', bg='white', border=3,
                         command=pdf_file_location)
select_pdf_file.grid(column=2, sticky='ne', row=2, pady=6, padx=60)

select_new_file_folder = Button(text='Select new file folder', fg='black', bg='white', border=3,
                                command=docx_folder_location)
select_new_file_folder.grid(column=2, sticky='ne', row=3, pady=74, padx=26)


root.mainloop()

Much appreciated. It's a good idea -> I'll put GUI into the backlog.

Would you like to make a bit more improvement, e.g. convert multi-pdf files under a user defined folder in a batch mode. After that, please submit a PR, so I can merge you work to this library to benefit more people.

batch mode you mean, to save as batch file and run it? I can do it windows exe. what do you prefer?

With your user interface, one can convert one file per time. But one might need to convert lots of pdf files, in such case, it's more convenient to put all pdf files in a folder, select that folder and convert them all per one go.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

harrylyf picture harrylyf  ·  5Comments

Jalkhov picture Jalkhov  ·  5Comments

echan00 picture echan00  ·  9Comments

startxc picture startxc  ·  4Comments

ispmarin picture ispmarin  ·  3Comments