Multi Word document conversion into PDF with Linux and LibreOffice and Unoconv
Efficient PDF Conversion with unoconv
When faced with the task of converting thousands of documents into PDF format for online portal access, tools like unoconv can streamline the process. This article outlines a step-by-step approach to mass conversion using LibreOffice, unoconv, and basic Python scripting on the Linux platform.
Prerequisites
Before proceeding with the conversion process, ensure the following software is installed:
- LibreOffice (Download from here)
- Unoconv (Installation instructions)
- Python 2.7.12 (Installation guidelines)
- py-unoconv-batch-recursive (Available on Github)
Installing Py-unoconv-batch-recursive
To set up py-unoconv-batch-recursive
, clone the repository to a convenient location on your system, like /<somewhere-easy>/py-unoconv-batch-recursive
. Execute the following command in a terminal window:
git clone https://github.com/enovision/py-unoconv-batch-recursive.git
Navigate to the root folder containing the documents to convert (e.g., /media/somewhere/CD-Data
) and run the Python script:
python /tmp/py-unoconv-batch-recursive/recursive-pdf-converter.py --in="/media/somewhere/CD-Data"
If the --in
parameter is omitted, the script uses the path where the Python script is located as the root directory for conversion.
By default, the program processes documents in formats like docx, doc, rtf, otf, and txt. To specify alternate file extensions, use the --ext
parameter:
--ext="doc docx yyy zzz"
The script traverses all subfolders from the root directory, converting files and appending '.pdf' to the original filenames. This prevents filename clashes during conversions. While an --out
option exists, it currently serves no purpose.
Unoconv
Unoconv facilitates file format conversions via the command line. The method of conversion using unoconv-LibreOffice
ensures the resultant PDFs are rendered as layered documents, preserving text and layout integrity.
These PDF outputs are ideal for integration with tools like ext-pdf-viewer, an Ext JS package leveraging Mozilla's pdf.js library.
Conclusion
Despite its simplistic nature, the Python script performed efficiently during the conversion of 4500 documents. On average, the process completed within 30 minutes on a standard laptop (as in 2019). The script incorporates a 20-second delay to allow the unoconv
listener to initialize, ensuring optimal performance. Upon completion, the listener is terminated, and a "Done" message signals the conclusion of the program.