Convert image to searchable pdf [closed]

2019-06-25 04:23发布

问题:

Hi I am looking for a open-source java API that can convert tiff image to searchable pdf (OCR). I have research around but found nothing so far.

NOTE I have looked at this post but this API does not convert the image to pdf Java OCR implementation. However, I am still playing with the code a bit.

回答1:

You can convert images to PDF using iText. The hard thing here is doing the OCR, not creating the PDF.

I will warn you: any OCR engine that is worth using is going to cost you a significant amount of money. Free and/or open source ones are generally pet projects, proof of concept for some algorithm or another. Not suitable for real world OCR applications. Tesseract is probably the best of the bunch, but even that has accuracies that are far, far worse than commercial engines.

We have a commercial OCR application, and I've been down this path while evaluating engines - I'd suggest that you bite the bullet and reach out to the engine providers and get quotes: Abbyy (best accuracy, most expensive, slower), Expervision (fast, not as accurate, middle of the road price), Nuance (middle of the road speed, accuracy and price). None of these will be written in Java, so you should plan some time to develop JNI code around their APIs.

Good luck - it's a big project!

回答2:

Cuneiform is free and easy to use, it will output in hocr format, which can then be used to generate an invisible text layer on a PDF using hocr2pdf tool, which is part of ExactImage.