I am converting hundreds of ODT files to PDF files, and it takes a long time doing one after the other. I have a CPU with multiple cores. Is it possible to use bash or python to write a script to do these in parallel? Is there a way to parallelize (not sure if I'm using the right word) batch document conversion using libreoffice from the command line? I have been doing it in python/bash calling the following commands:
libreoffice --headless --convert-to pdf *appsmergeme.odt
OR
subprocess.call(str('cd $HOME; libreoffice --headless --convert-to pdf *appsmergeme.odt'), shell=True);
Thank you!
Tim
You can run libreoffice as a daemon/service. Please check the following link, maybe it helps you too: Daemonize the LibreOffice service
Other posibility is to use unoconv. "unoconv is a command line utility that can convert any file format that OpenOffice can import, to any file format that OpenOffice is capable of exporting."
Untested potentially valid:
You /may/ be able to:
e.g.
By using
su -
you won't accidentally inherit any environment variables from your real session, so the parallel processes shouldn't interfere with one another (aside from competing for resources).Keep in mind, these are likely I/O-bound tasks, so running 1 per CPU core will probably not speed you up so very much.
I've written a program in golang to batch convert thousands of doc/xls files.
Sometimes it can happen that Libreoffice doesn't convert some files, so you should open it and convert them to PDF manually. Luckily, they were only 10 out of my 16.000 documents to convert.
Since the author already introduced Python as a valid answer:
Using a thread pool instead of a process pool by
multiprocessing.dummy
is sufficient because new processes for real parallelism are spawn bysubprocess.call()
anyway.We can set the command as well as the current working directory
cwd
directly. No need to load ashell
for each file for just doing that. Furthermore,os.path
enables cross-platform interoperability.We had a similar problem with unoconv. unoconv internally makes use of libreoffice. We solved it by sending multiple files to unoconv in one invocation. So, instead of iterating over all files, we just partition the set of files into buckets, each bucket representing the o/p format. Then we make as many calls as there are buckets.
I am pretty sure libreoffice also has a similar mode.
this thread or answer is old. I tested libreoffice 4.4, I can confirm I can run libreoffice concurrently. see my script.