Extract a region of a PDF page by coordinates

2019-03-16 18:44发布

问题:

I am looking for a tool to extract a given rectangular region (by coordinates) of a 1-page PDF file and produce a 1-page PDF file with the specified region:

# in.pdf is a 1-page pdf file
extract file.pdf 0 0 100 100 > out.pdf
# out.pdf is now a 1-page pdf file with a page of size 100x100
# it contains the region (0, 0) to (100, 100) of file.pdf

I could convert the PDF to an image and use convert, but this would mean that the resulting PDF would not be vectorial anymore, which is not acceptable (I want to be able to zoom).

I would ideally like to perform this task with a command-line tool or a Python library.

Thanks!

回答1:

The following script found in http://snipplr.com/view.php?codeview&id=18924 splits each page of a pdf into 2.

#!/usr/bin/env perl
use strict; use warnings;
use PDF::API2;

my $filename = shift;
my $oldpdf = PDF::API2->open($filename);
my $newpdf = PDF::API2->new;

for my $page_nb (1..$oldpdf->pages) {
  my ($page, @cropdata);

  $page = $newpdf->importpage($oldpdf, $page_nb);
  @cropdata = $page->get_mediabox;
  $cropdata[2] /= 2;
  $page->cropbox(@cropdata);
  $page->trimbox(@cropdata);
  $page->mediabox(@cropdata);

  $page = $newpdf->importpage($oldpdf, $page_nb);
  @cropdata = $page->get_mediabox;
  $cropdata[0] = $cropdata[2] / 2;
  $page->cropbox(@cropdata);
  $page->trimbox(@cropdata);
  $page->mediabox(@cropdata);
}

(my $newfilename = $filename) =~ s/(.*)\.(\w+)$/$1.clean.$2/;
$newpdf->saveas('destination_path/myfile.pdf');


回答2:

using pyPdf, you could do something like this:

import sys
import pyPdf

def extract(in_file, coords, out_file):
    with open(in_file, 'rb') as infp:
        reader = pyPdf.PdfFileReader(infp)
        page = reader.getPage(0)
        writer = pyPdf.PdfFileWriter()
        page.mediaBox.lowerLeft = coords[:2]
        page.mediaBox.upperRight = coords[2:]
        # you could do the same for page.trimBox and page.cropBox
        writer.addPage(page)
        with open(out_file, 'wb') as outfp:
            writer.write(outfp)

if __name__ == '__main__':
    in_file = sys.argv[1]
    coords = [int(i) for i in sys.argv[2:6]]
    out_file = sys.argv[6]

    extract(in_file, coords, out_file)