Imagemagick parallel conversion

2019-09-06 03:47发布

问题:

I want to get screenshot of each page of a pdf into jpg. To do this I am using ImageMagick's convert command in command line.

I have to achieve the following -

  1. Get screenshots of each page of the pdf file.
  2. resize the screenshot into 3 different sizes (small, med and preview).
  3. store the different sizes in different folders (small, med and preview).

I am using the following command which works, however, it is slow. How can I improve its execution time or execute the commands parallely.

convert -density 400 -quality 100 /input/test.pdf -resize 170x117> -scene 1 /small/test_%d_small.jpg & convert -density 400 -quality 100 /input/test.pdf -resize 230x160> -scene 1 /med/test_%d_med.jpg & convert -density 400 -quality 100 /input/test.pdf -resize 1310x650> -scene 1 /preview/test_%d_preview.jpg

Splitting the command for readability

convert -density 400 -quality 100 /input/test.pdf -resize 170x117> -scene 1 /small/test_%d_small.jpg

convert -density 400 -quality 100 /input/test.pdf -resize 230x160> -scene 1 /med/test_%d_med.jpg 

convert -density 400 -quality 100 /input/test.pdf -resize 1310x650> -scene 1 /preview/test_%d_preview.jpg

回答1:

Updated Answer

I see you have long, multi-page documents and while my original answer is good for making multiple sizes of a single page quickly, it doesn't address doing pages in parallel. So, here is a way of doing it using GNU Parallel which is available for free for OS X (using homebrew), installed on most Linux distros and also available for Windows - if you really must.

The code looks like this:

#!/bin/bash

shopt -s nullglob
shopt -s nocaseglob

doPage(){
   # Expecting filename as first parameter and page number as second
   # echo DEBUG: File: $1 Page: $2
   noexten=${1%%.*}
   convert -density 400 -quality 100 "$1[$2]"     \
      -resize 1310x650 -write "${noexten}-p-$2-large.jpg" \
      -resize 230x160  -write "${noexten}-p-$2-med.jpg"   \
      -resize 170x117  "${noexten}-p-$2-small.jpg"
}

export -f doPage

# First, get list of all PDF documents
for d in *.pdf; do
   # Now get number of pages in this document - "pdfinfo" is probably quicker
   p=$(identify "$d" | wc -l)
   for ((i=0;i<$p;i++));do
      echo $d:$i
   done
done | parallel --eta --colsep ':' doPage {1} {2}

If you want to see how it works, remove the | parallel .... from the last line and you will see that the preceding loop just echoes a list of filenames and a counter for the page number into GNU Parallel. It will then run one process per CPU core, unless you specify -j 8 if you want say 8 processes to run in parallel. Remove the --eta if you don't want any updates on when the command is likely to finish.

In the comment I allude to pdfinfo being faster than identify, if you have that available (it's part of the poppler package under homebrew on OS X), then you can use this to get the number of pages in a PDF:

pdfinfo SomeDocument.pdf | awk '/^Pages:/ {print $2}'

Original Answer

Untested, but something along these lines so you only read it in once and then generate successively smaller images from the largest one:

convert -density 400 -quality 100 x.pdf \
   -resize 1310x650 -write large.jpg    \
   -resize 230x160  -write medium.jpg   \
   -resize 170x117  small.jpg

Unless you mean you have, say, a 50 page PDF, and you want to do all 50 pages in parallel. If you do, say so, and I'll show you that using GNU Parallel when I get up in 10 hours...