I'm working on a project that entails photographing text (from any hard copy of text) and converting that text into a text file. Then I'd like to use that text file to do some different things, such as provide hyperlinks to news articles or allow the user to edit the document.
The tool I've tried so far is Java OCR from sourceforge.net, which works fine on the images provided in the package. But when I photograph my own text, it doesnt work at all. Is there some training process I should be implementing? If so, does anybody know how to implement it? Any help will go a long way. Thank you!
I have a java application where I ended up deciding to use Tesseract OCR, and just call out to it using Runtime.exec()
. Perhaps not quite the answer you need, but just in case you'd not considered it.
Edit + code added in response to comment reply
- On a Windows installation I think I was able to use an installer, or unzip a ready made binary.
On a Linux server, I needed to compile Tesseract myself, but it's not too hard if you're used to that kind of thing (gcc); the only gotcha is that there's a dependency on Leptonica which also needs to be compiled.
// Tesseract can only handle .tif format, so we have to convert it
ImageIO.write( ImageIO.read( new java.io.File(file.getPath())), "tif", tmpFile[0]);
String[] tesseractCmd = new String[]{"tesseract", tmpFile[0].getAbsolutePath(), StringUtils.removeEnd(tmpFile[1].getAbsolutePath(), ".txt")};
final Process process = Runtime.getRuntime().exec(tesseractCmd);
try {
int exitValue = process.waitFor();
if(exitValue == 0) {
final String extractedText = SearchableTextExtractionUtils.extractPlainText(new FileReader(tmpFile[1]));
return extractedText;
}
throw new SearchableTextExtractionException(exitValue, Arrays.toString(tesseractCmd));
} catch (InterruptedException e) {
throw new SearchableTextExtractionException(e);
} finally {
process.destroy();
}