I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.
相关问题
- Correctly parse PDF paragraphs with Python
- Set BaseUrl of an existing Pdf Document
- Should “operator !=” always be implemented via “op
- How can I get all text from a PDF in Swift?
- Renaming named destinations in PDF files
相关文章
- Java PDFBox 向PDF文件中写入图片
- How do I get characters common to two vectors in C
- Python Sendgrid send email with PDF attachment fil
- C# MVC website PDF file in stored in byte array, d
- How To Programmatically Enable/Disable 'Displa
- How to reduce PDF file size programmatically in Ja
- Search and replace placeholder text in PDF with Py
-
Compare Delegates Action
I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).
Of course, you need to install the ImageMagick bindings first:
I have come up with a jar using apache pdfbox to compare pdf files - this can compare
pixel by pixel
& highlight the differences.Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.
To get page count
To get page content as plain text
To extract attached images from PDF
To store PDF pages as images
To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)
To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)
If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.
Here's a screenshot:
You can do the same thing with a shell script on Linux. The script wraps 3 components:
compare
commandpdftk
utilityIt's rather easy to translate this into a
.bat
Batch file for DOS/Windows...Here are the building blocks:
pdftk
Use this command to split multipage PDF files into multiple singlepage PDFs:
compare
Use this command to create a "diff" PDF page for each of the pages:
Note, that
compare
is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.Once more, pdftk
Now you can again concatenate your "diff" PDF pages with
pdftk
:Ghostscript
Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the
bmp256
output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:Just create an all-white BMP page with its MD5sum (for reference) like this: