I need to invoke tesseract OCR (its an open source library in C++ that does Optical Character Recognition) from a Java Application Server. Right now its easy enough to run the executable using Runtime.exec(). The basic logic would be
- Save image that is currently held in memory to file (a .tif)
- pass in the image file name to the tesseract command line program.
- read in the output text file from Java using FileReader.
How much improvement in terms of performance am I likely to get by writing a JNI wrapper for Tesseract? Unfortunately there is not an open source JNI wrapper that works in Linux. I would have to do it myself and am wondering about whether the benefit is worth the development cost.
I'm agree with tweakt. Do not use JNI if there is no perfomance reasons to do this. Your application stability is also could be in danger if you use JNI calls if there will be some possibilities of memory leaks or even crashes in your JNI layer or in OCR itself. This will never happen if you use it via command line interface (All memory will be released at the program exit and all abnormal program terminations can be checked in the caller code).
If you do pursue your own wrapper, I recommend you check out JNA. It will allow you to call most "native" libraries writing only Java code, and will give you more help than does raw JNI to do it safely. JNA is available for most platforms.
It's hard to say whether it would be worth it. If you assume that if done in-process via JNI, the OCR code can directly access the image data without having to write it to a file, then it would certainly eliminate any disk I/O constraints there.
I'd recommend going with the simpler approach and only undertaking the JNI option if performance is not acceptable. At least then you'll be able to do some benchmarking and estimate the performance gains you might be able to realize.