可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.

I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.

By this approach I'm able to know that particular page is differing.

But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.

Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.

Please suggest me someway to achieve this.

PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.

回答1:

PDF to image using Java

Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)

https://www.google.com.br/search?q=PixelGraber&ie=utf-8&oe=utf-8&rls=org.mozilla:pt-BR:official&client=firefox-a&gws_rd=cr&ei=K1PhUqD2Jei0sQTQs4DoAw

A good library for converting PDF to TIFF?

Convert jpeg/png to an array of pixels in java

int pixels array to bmp in java

Finding pixel position

Get Pixel Color around an image

For extraction of text using PDFBox: Extracting text from PDF file using pdfbox

There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.

http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html

http://pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html