Multi Word document conversion into PDF with Linux and LibreOffice and Unoconv

Multi Word document conversion into PDF with Linux and LibreOffice and Unoconv

The article demonstrates how to efficiently convert a large number of documents into PDF format using LibreOffice, Unoconv, and a simple Python script (provided within the guide).

Efficient PDF Conversion with unoconv

When faced with the task of converting thousands of documents into PDF format for online portal access, tools like unoconv can streamline the process. This article outlines a step-by-step approach to mass conversion using LibreOffice, unoconv, and basic Python scripting on the Linux platform.

Prerequisites

Before proceeding with the conversion process, ensure the following software is installed:

Installing Py-unoconv-batch-recursive

To set up py-unoconv-batch-recursive, clone the repository to a convenient location on your system, like /<somewhere-easy>/py-unoconv-batch-recursive. Execute the following command in a terminal window:

git clone https://github.com/enovision/py-unoconv-batch-recursive.git

Navigate to the root folder containing the documents to convert (e.g., /media/somewhere/CD-Data) and run the Python script:

python /tmp/py-unoconv-batch-recursive/recursive-pdf-converter.py --in="/media/somewhere/CD-Data"

If the --in parameter is omitted, the script uses the path where the Python script is located as the root directory for conversion.

By default, the program processes documents in formats like docx, doc, rtf, otf, and txt. To specify alternate file extensions, use the --ext parameter:

--ext="doc docx yyy zzz"

The script traverses all subfolders from the root directory, converting files and appending '.pdf' to the original filenames. This prevents filename clashes during conversions. While an --out option exists, it currently serves no purpose.

Unoconv

Unoconv facilitates file format conversions via the command line. The method of conversion using unoconv-LibreOffice ensures the resultant PDFs are rendered as layered documents, preserving text and layout integrity.

These PDF outputs are ideal for integration with tools like ext-pdf-viewer, an Ext JS package leveraging Mozilla's pdf.js library.

Conclusion

Despite its simplistic nature, the Python script performed efficiently during the conversion of 4500 documents. On average, the process completed within 30 minutes on a standard laptop (as in 2019). The script incorporates a 20-second delay to allow the unoconv listener to initialize, ensuring optimal performance. Upon completion, the listener is terminated, and a "Done" message signals the conclusion of the program.

More from same category