How can I tell the resolution of scanned PDF from

2019-03-13 06:27发布

I have a large collection of documents scanned into PDF format, and I wish to write a shell script that will convert each document to DjVu format. Some documents were scanned at 200dpi, some at 300dpi, and some at 600dpi. Since DjVu is a pixel-based format, I want to be sure I use the same resolution in the target DjVu file as was used for the scan.

Does anyone know what program I can run, or how I can write a program, to determine what resolution was used to produce a scanned PDF? (Number of pixels might work too as almost all documents are 8.5 by 11 inches.)


Clarification after responses: I'm aware of the difficulties highlighted by Breton, and I'm willing to concede that the problem in general is ill-posed, but I'm not asking about general PDF documents. My particular documents came out of a scanner. They contain one scanned image per page, same resolution each page. If I convert the PDF to PostScript I can poke around by hand and find pixel dimensions easily; I could probably find image sizes with more work. And if in desperate need I could modify the dictionary stack that gs is using; long ago, I wrote an interpreter for PostScript Level 1.

All of that is what I'm trying to avoid.


Thanks to help received, I've posted an answer below:

  1. Extract the bounding box from the PDF using identify, taking only the output for the first page, and understanding that the units will be PostScript points, of which there are 72 to an inch.
  2. Extract images from the first page using pdfimages.
  3. Get height and width of image. This time identify will give number of pixels.
  4. Add the total areas of the images to get the number of dots squared.
  5. To get resolution, compute areas of bounding box in inches squared, divide dots squared by inches squared, take the square root, and round to the nearest multiple of 10.

Full answer with script is below. I'm using it in live fire and it works great. Thanks Harlequin for pdfimages and Spiffeah for the alert about multiple images per page (it's rare, but I've found some).

标签: pdf shell
7条回答
Fickle 薄情
2楼-- · 2019-03-13 07:14

Here are the elements to this answer:

  • pdfimages will extract images so that the number of dots can be discovered.
  • identify will give the size of the image in units of PostScript points (72 to the inch)
  • Because some scanners may split a single page into multiple images of varying sizes and shapes, the key is to add up the areas of all the images. Dividing square dots by square inches and taking the square root produces the answer.

Below is a Lua script that solves the problem. I probably could have used a plain shell, but capturing the width and height would have been a greater nuisance.

#!/usr/bin/env lua

require 'osutil'
require 'posixutil'
require 'mathutil'

local function runf(...) return os.execute(string.format(...)) end

assert(arg[1], "no file on command line")

local function dimens(filename)
  local cmd = [[identify -format "return %w, %h\n" $file | sed 1q]]
  cmd = cmd:gsub('$file', os.quote(filename))
  local w, h = assert(loadstring(os.capture(cmd)))()
  assert(w and h)
  return w, h
end

assert(#arg == 1, "dpi of just one file")

for _, pdf in ipairs(arg) do
  local w, h = dimens(pdf)  -- units are points
  local insquared = w * h / (72.00 * 72.00)
  local imagedir = os.capture 'mktemp -d'
  assert(posix.isdir(imagedir))
  runf('pdfimages -f 1 -l 1 %s %s 1>&2', os.quote(pdf),
                                         os.quote(imagedir .. '/img'))
  local dotsquared = 0
  for file in posix.glob(imagedir .. '/img*') do
    local w, h = dimens(file)  -- units are pixels
    dotsquared = dotsquared + w * h
  end
  os.execute('rm -rf ' .. os.quote(imagedir))
  local dpi = math.sqrt(dotsquared / insquared)

  if true then
    io.stderr:write(insquared, " square inches\n")
    io.stderr:write(dotsquared, " square dots\n")
    io.stderr:write(dpi, " exact dpi\n")
    io.stderr:write(math.round(dpi, 10), " rounded dpi\n")
  end
  print(math.round(dpi, 10))
end
查看更多
登录 后发表回答