Tess4j on Windows 64-bit: exception on multiple th

2019-04-11 08:48发布

问题:

I am using tesseract 3 with Java 8 on Windows 64-bit to OCR scanned PDFs. I have followed the instructions on the Tess4j page and have used the 64-bit versions of the required DLLs, and have installed 64-bit Ghostscript.

When I run my unit test with the normal @Test (no arguments), the code runs correctly, so I guess I have installed everything correctly.

When I run it with 2 threads in parallel (see below) I get an exception.

I have read the relevant thread here, but there it is suggested to use Tesseract1, which I am using (I have tried both).

Any ideas?

This is the code:

//  @Test // works
@Test(invocationCount = 2, threadPoolSize = 2)
public void testOcr() throws OcrException, TesseractException {
    File scannedPdf = new File(this.getClass().getClassLoader().getResource("scanned.pdf").getFile());
//  Tesseract instance = Tesseract.getInstance();  // JNA Interface Mapping
    Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
    String str = instance.doOCR(scannedPdf);
    System.out.println("OCR Result: " + str);
}

This is the exception:

log4j:WARN No appenders could be found for logger (org.ghost4j.Ghostscript).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Ιουλ 16, 2014 6:22:23 ΜΜ net.sourceforge.vietocr.PdfUtilities convertPdf2Png
SEVERE: Cannot initialize Ghostscript interpreter. Error code is -21
org.ghost4j.GhostscriptException: Cannot initialize Ghostscript interpreter. Error code is -21
    at org.ghost4j.Ghostscript.initialize(Ghostscript.java:365)
    at net.sourceforge.vietocr.PdfUtilities.convertPdf2Png(Unknown Source)
    at net.sourceforge.vietocr.PdfUtilities.convertPdf2Tiff(Unknown Source)
    at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at OcrUtilsTest.testOcr(OcrUtilsTest.java:19)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

java.lang.Error: Invalid memory access
    at com.sun.jna.Native.invokeInt(Native Method)
    at com.sun.jna.Function.invoke(Function.java:383)
    at com.sun.jna.Function.invoke(Function.java:315)
    at com.sun.jna.Library$Handler.invoke(Library.java:212)
    at com.sun.proxy.$Proxy3.gsapi_init_with_args(Unknown Source)
    at org.ghost4j.Ghostscript.initialize(Ghostscript.java:350)
    at net.sourceforge.vietocr.PdfUtilities.convertPdf2Png(Unknown Source)
    at net.sourceforge.vietocr.PdfUtilities.convertPdf2Tiff(Unknown Source)
    at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at OcrUtilsTest.testOcr(OcrUtilsTest.java:19)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
net.sourceforge.tess4j.TesseractException: javax.imageio.IIOException: I/O error reading header!
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Unknown Source)
    at OcrUtilsTest.testOcr(OcrUtilsTest.java:19)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    at org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    at org.testng.internal.thread.ThreadUtil$2.call(ThreadUtil.java:64)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.imageio.IIOException: I/O error reading header!
    at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.readHeader(TIFFImageReader.java:224)
    at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.locateImage(TIFFImageReader.java:231)
    at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.getNumImages(TIFFImageReader.java:279)
    at net.sourceforge.vietocr.ImageIOHelper.getIIOImageList(Unknown Source)
    ... 18 more
Caused by: java.io.EOFException
    at javax.imageio.stream.ImageInputStreamImpl.readShort(ImageInputStreamImpl.java:229)
    at javax.imageio.stream.ImageInputStreamImpl.readUnsignedShort(ImageInputStreamImpl.java:242)
    at com.sun.media.imageioimpl.plugins.tiff.TIFFImageReader.readHeader(TIFFImageReader.java:199)
    ... 21 more

UPDATE: It seems related to this.

回答1:

Tesseract on its own can only convert images to text, and not PDFs, even if the PDFs are scanned.

Under the hood, Tess4j uses Ghostscript (through ghost4j) to convert each page to a single image file, which it then feeds to Tesseract for OCR. It concatenates the resulting strings into a single string, which it returns.

The reason for the exception is that Tess4j uses Ghost4j in a way that does not support multithreading. As described here, ghost4j does provide multithreading support from its high-level API (actually it runs different instances of Ghostscript separately each invoked from a different JVM). Tess4j, however, uses its low-level API, where a single Ghostscript instance may be used.