I want to get screenshot of each page of a pdf
into jpg
. To do this I am using ImageMagick
's convert
command in command line.
I have to achieve the following -
- Get screenshots of each page of the pdf file.
- resize the screenshot into 3 different sizes (small, med and preview).
- store the different sizes in different folders (small, med and preview).
I am using the following command which works, however, it is slow. How can I improve its execution time or execute the commands parallely.
convert -density 400 -quality 100 /input/test.pdf -resize 170x117> -scene 1 /small/test_%d_small.jpg & convert -density 400 -quality 100 /input/test.pdf -resize 230x160> -scene 1 /med/test_%d_med.jpg & convert -density 400 -quality 100 /input/test.pdf -resize 1310x650> -scene 1 /preview/test_%d_preview.jpg
Splitting the command for readability
convert -density 400 -quality 100 /input/test.pdf -resize 170x117> -scene 1 /small/test_%d_small.jpg
convert -density 400 -quality 100 /input/test.pdf -resize 230x160> -scene 1 /med/test_%d_med.jpg
convert -density 400 -quality 100 /input/test.pdf -resize 1310x650> -scene 1 /preview/test_%d_preview.jpg
Updated Answer
I see you have long, multi-page documents and while my original answer is good for making multiple sizes of a single page quickly, it doesn't address doing pages in parallel. So, here is a way of doing it using GNU Parallel which is available for free for OS X (using homebrew
), installed on most Linux distros and also available for Windows - if you really must.
The code looks like this:
#!/bin/bash
shopt -s nullglob
shopt -s nocaseglob
doPage(){
# Expecting filename as first parameter and page number as second
# echo DEBUG: File: $1 Page: $2
noexten=${1%%.*}
convert -density 400 -quality 100 "$1[$2]" \
-resize 1310x650 -write "${noexten}-p-$2-large.jpg" \
-resize 230x160 -write "${noexten}-p-$2-med.jpg" \
-resize 170x117 "${noexten}-p-$2-small.jpg"
}
export -f doPage
# First, get list of all PDF documents
for d in *.pdf; do
# Now get number of pages in this document - "pdfinfo" is probably quicker
p=$(identify "$d" | wc -l)
for ((i=0;i<$p;i++));do
echo $d:$i
done
done | parallel --eta --colsep ':' doPage {1} {2}
If you want to see how it works, remove the | parallel ....
from the last line and you will see that the preceding loop just echoes a list of filenames and a counter for the page number into GNU Parallel. It will then run one process per CPU core, unless you specify -j 8
if you want say 8 processes to run in parallel. Remove the --eta
if you don't want any updates on when the command is likely to finish.
In the comment I allude to pdfinfo
being faster than identify
, if you have that available (it's part of the poppler
package under homebrew
on OS X), then you can use this to get the number of pages in a PDF:
pdfinfo SomeDocument.pdf | awk '/^Pages:/ {print $2}'
Original Answer
Untested, but something along these lines so you only read it in once and then generate successively smaller images from the largest one:
convert -density 400 -quality 100 x.pdf \
-resize 1310x650 -write large.jpg \
-resize 230x160 -write medium.jpg \
-resize 170x117 small.jpg
Unless you mean you have, say, a 50 page PDF, and you want to do all 50 pages in parallel. If you do, say so, and I'll show you that using GNU Parallel when I get up in 10 hours...