Convert image to searchable pdf [closed]

2019-06-25 04:39发布

Hi I am looking for a open-source java API that can convert tiff image to searchable pdf (OCR). I have research around but found nothing so far.

NOTE I have looked at this post but this API does not convert the image to pdf Java OCR implementation. However, I am still playing with the code a bit.

标签: java pdf ocr tiff
2条回答
啃猪蹄的小仙女
2楼-- · 2019-06-25 04:50

Cuneiform is free and easy to use, it will output in hocr format, which can then be used to generate an invisible text layer on a PDF using hocr2pdf tool, which is part of ExactImage.

查看更多
迷人小祖宗
3楼-- · 2019-06-25 04:58

You can convert images to PDF using iText. The hard thing here is doing the OCR, not creating the PDF.

I will warn you: any OCR engine that is worth using is going to cost you a significant amount of money. Free and/or open source ones are generally pet projects, proof of concept for some algorithm or another. Not suitable for real world OCR applications. Tesseract is probably the best of the bunch, but even that has accuracies that are far, far worse than commercial engines.

We have a commercial OCR application, and I've been down this path while evaluating engines - I'd suggest that you bite the bullet and reach out to the engine providers and get quotes: Abbyy (best accuracy, most expensive, slower), Expervision (fast, not as accurate, middle of the road price), Nuance (middle of the road speed, accuracy and price). None of these will be written in Java, so you should plan some time to develop JNI code around their APIs.

Good luck - it's a big project!

查看更多
登录 后发表回答