Pdf real cropping

2019-09-01 07:31发布

问题:

I need to crop a pdf document using the linux shell and then extract the text just in that cropped pdf.

My idea was to crop a pdf using pdfcrop linux tool and then use a txt2pdf text extractor tool to extract the text just in the cropped area, but i've realized that i'm thinking on images, and when i try to do this the result is the same than doing it over the original, not cropped, pdf.

I guess it's a layer problem. As the pdf format works with layers, if i don't "crop" all the layers, the result is gonna include all the information from all the layers, which i don't want.

I would appreciate so much if someone has any idea of how i could do a real "all layers cropping" in a pdf. If its possible or if i should start thinking on another solution.

回答1:

Its not layers, its the fact that cropping a PDF usually involves simply setting the CropBox, which doesn't alter the actual contents of the PDF (other than the CropBox) at all. Most text extraction code will ignore the CropBox and extract all the text....

You could, with some effort, use Ghostscript to produce a genuinely cropped PDF (though note that partially cropped glyphs will still be included) and then extract the text from that. But that's pretty ugly.

Alternatively Ghostscript and MuPDF can both extract text with co-ordinate information, which may be enough for your needs.