How to compare two PDFs based on visual difference

2019-07-01 11:05发布

问题:

I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.

I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.

By this approach I'm able to know that particular page is differing.

But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.

Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.

Please suggest me someway to achieve this.

PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.

回答1:

PDF to image using Java

Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)

https://www.google.com.br/search?q=PixelGraber&ie=utf-8&oe=utf-8&rls=org.mozilla:pt-BR:official&client=firefox-a&gws_rd=cr&ei=K1PhUqD2Jei0sQTQs4DoAw

A good library for converting PDF to TIFF?

Convert jpeg/png to an array of pixels in java

int pixels array to bmp in java

Finding pixel position

Get Pixel Color around an image

For extraction of text using PDFBox: Extracting text from PDF file using pdfbox

There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.

http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html

http://pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html



回答2:

Check out this Java package: https://java.net/projects/pdf-renderer

You can convert the pdf to an image and then traverse the image as a 2D array and compare differences like that.