I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.
I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.
By this approach I'm able to know that particular page is differing.
But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.
Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.
Please suggest me someway to achieve this.
PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.
PDF to image using Java
Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)
https://www.google.com.br/search?q=PixelGraber&ie=utf-8&oe=utf-8&rls=org.mozilla:pt-BR:official&client=firefox-a&gws_rd=cr&ei=K1PhUqD2Jei0sQTQs4DoAw
A good library for converting PDF to TIFF?
Convert jpeg/png to an array of pixels in java
int pixels array to bmp in java
Finding pixel position
Get Pixel Color around an image
For extraction of text using PDFBox: Extracting text from PDF file using pdfbox
There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.
http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html
http://pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html
Check out this Java package: https://java.net/projects/pdf-renderer
You can convert the pdf to an image and then traverse the image as a 2D array and compare differences like that.