i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.
My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):
public String extractText(InputStream stream) {
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
String text = handler.toString();
return text;
}
I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImages
method of the PDFParserConfig
class but this didn't change a thing.
Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractor
did extract embedded resources of a doc file but not for my PDF files.
It would be awesome if anyone of you could provide some help :)
Tim Allison brought the solution:
This works for me :)
EDIT: Here is the complete solution:
Maven Dependencies: