Parallel Document Conversion ODT > PDF Libreoffice

2019-03-25 16:45发布

I am converting hundreds of ODT files to PDF files, and it takes a long time doing one after the other. I have a CPU with multiple cores. Is it possible to use bash or python to write a script to do these in parallel? Is there a way to parallelize (not sure if I'm using the right word) batch document conversion using libreoffice from the command line? I have been doing it in python/bash calling the following commands:

libreoffice --headless --convert-to pdf *appsmergeme.odt

OR

subprocess.call(str('cd $HOME; libreoffice --headless --convert-to pdf *appsmergeme.odt'), shell=True);

Thank you!

Tim

6条回答
放我归山
2楼-- · 2019-03-25 17:31

You can run libreoffice as a daemon/service. Please check the following link, maybe it helps you too: Daemonize the LibreOffice service

Other posibility is to use unoconv. "unoconv is a command line utility that can convert any file format that OpenOffice can import, to any file format that OpenOffice is capable of exporting."

查看更多
三岁会撩人
3楼-- · 2019-03-25 17:33

Untested potentially valid:

You /may/ be able to:

  • Divide up the files into a number of parallel batches in some equitable way, e.g. placing them all in folders;
  • Create a distinct local user account to handle each folder;
  • Run Libreoffice serially as each user

e.g.

 for paralleluser in timlev1 timlev2 timlev3 timlev4 ; do
      su - $paralleluser -c \
         "for file in /var/spool/pdfbatches/$paralleluser ; do \
            libreoffice --headless --convert-to pdf $file ; done" 
 done

By using su - you won't accidentally inherit any environment variables from your real session, so the parallel processes shouldn't interfere with one another (aside from competing for resources).

Keep in mind, these are likely I/O-bound tasks, so running 1 per CPU core will probably not speed you up so very much.

查看更多
欢心
4楼-- · 2019-03-25 17:36

I've written a program in golang to batch convert thousands of doc/xls files.

  • define the "root" variable value to the path of your documents to convert
  • already converted documents to pdf are skipped (if not, comment the check condition in the visit() function)
  • here I'm using 4 threads (I have an Intel i3 with 4 cores). You can modify the value in the main() function

Sometimes it can happen that Libreoffice doesn't convert some files, so you should open it and convert them to PDF manually. Luckily, they were only 10 out of my 16.000 documents to convert.

package main

import (
    "os/exec"
    "sync"
    "path/filepath"
    "os"
    "fmt"
    "strings"
)

// root dir of your documents to convert
root := "/.../conversion-from-office/"

var tasks = make(chan *exec.Cmd, 64)

func visit(path string, f os.FileInfo, err error) error {
    if (f.IsDir()) {
        // fmt.Printf("Entering %s\n", path)
    } else {
        ext := filepath.Ext(path)
        if (strings.ToLower (ext) == "pdf") {
        } else {


            outfile := path[0:len(path)-len(ext)] + ".pdf"

            if _, err := os.Stat(outfile); os.IsNotExist(err) {

                fmt.Printf("Converting %s\n", path)

                outdir := filepath.Dir(path)
                tasks <- exec.Command("soffice", "--headless", "--convert-to", "pdf", path, "--outdir", outdir)
            }
        }
    }
    return nil
} 


func main() {
    // spawn four worker goroutines
    var wg sync.WaitGroup

    // the  ...; i < 4;... indicates that I'm using 4 threads
    for i := 0; i < 4; i++ {
        wg.Add(1)
        go func() {
            for cmd := range tasks {
                cmd.Run()
            }
            wg.Done()
        }()
    }


    err := filepath.Walk(root, visit)
    fmt.Printf("filepath.Walk() returned %v\n", err)

    close(tasks)

    // wait for the workers to finish
    wg.Wait()
}
查看更多
迷人小祖宗
5楼-- · 2019-03-25 17:38

Since the author already introduced Python as a valid answer:

import subprocess
import os, glob
from multiprocessing.dummy import Pool    # wrapper around the threading module

def worker(fname, dstdir=os.path.expanduser("~")):
    subprocess.call(["libreoffice", "--headless", "--convert-to", "pdf", fname],
                    cwd=dstdir)

pool = Pool()
pool.map(worker, glob.iglob(
        os.path.join(os.path.expanduser("~"), "*appsmergeme.odt")
    ))

Using a thread pool instead of a process pool by multiprocessing.dummy is sufficient because new processes for real parallelism are spawn by subprocess.call() anyway.

We can set the command as well as the current working directory cwd directly. No need to load a shell for each file for just doing that. Furthermore, os.path enables cross-platform interoperability.

查看更多
Emotional °昔
6楼-- · 2019-03-25 17:43

We had a similar problem with unoconv. unoconv internally makes use of libreoffice. We solved it by sending multiple files to unoconv in one invocation. So, instead of iterating over all files, we just partition the set of files into buckets, each bucket representing the o/p format. Then we make as many calls as there are buckets.

I am pretty sure libreoffice also has a similar mode.

查看更多
Fickle 薄情
7楼-- · 2019-03-25 17:47

this thread or answer is old. I tested libreoffice 4.4, I can confirm I can run libreoffice concurrently. see my script.

for odt in test*odt ; do
echo $odt
soffice --headless --convert-to pdf $odt & 
ps -ef|grep ffice 
done

查看更多
登录 后发表回答