Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work.
My use case is that I want some code that will extract the content and separately the images from any documents (not necessarily PDFs). This then gets passed into an Apache UIMA pipeline.
I've been able to extract images from other document types by using a custom parser (built on an AutoParser) to convert the documents to HTML and then save the images out separately. When I try with PDFs though, the tags don't even appear in the HTML, let along give me access to the files.
Could someone suggest how I might achieve the above, preferably with some code examples of how to do inline image extraction from PDFs with Tika 1.6?
Try the code bellow, ContentHandler turned has your xml content.
public ContentHandler convertPdf(byte[] content, String path, String filename)throws IOException, SAXException, TikaException{
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
ContentHandler handler = new ToXMLContentHandler();
PDFParser parser = new PDFParser();
PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
config.setExtractUniqueInlineImagesOnly(true);
parser.setPDFParserConfig(config);
EmbeddedDocumentExtractor embeddedDocumentExtractor =
new EmbeddedDocumentExtractor() {
@Override
public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}
@Override
public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException {
Path outputFile = new File(path+metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
Files.copy(stream, outputFile);
}
};
context.set(PDFParser.class, parser);
context.set(EmbeddedDocumentExtractor.class,embeddedDocumentExtractor );
try (InputStream stream = new ByteArrayInputStream(content)) {
parser.parse(stream, handler, metadata, context);
}
return handler;
}
It is possible to use an AutoParser
to extract images, without relying on PDFParser
. This code works just as well for extracting images out from docx, pptx, etc.
Here I have a parseDocument()
and a setPdfConfig()
function which makes use of a AutoParser
.
- I create an
AutoParser
- Attach a
EmbeddedDocumentExtractor
onto a ParseContext
.
- Attach the
AutoParser
onto the same ParseContext
.
- Attach a
PDFParserConfig
onto the same ParseContext
.
- Then give that
ParseContext
to AutoParser.parse()
.
The images are saved to a folder in the same location as the source file, with the name <sourceFile>_/
.
private static void setPdfConfig(ParseContext context) {
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(true);
context.set(PDFParserConfig.class, pdfConfig);
}
private static String parseDocument(String path) {
String xhtmlContents = "";
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
EmbeddedDocumentExtractor embeddedDocumentExtractor =
new EmbeddedDocumentExtractor() {
@Override
public boolean shouldParseEmbedded(Metadata metadata) {
return true;
}
@Override
public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException {
Path outputDir = new File(path + "_").toPath();
Files.createDirectories(outputDir);
Path outputPath = new File(outputDir.toString() + "/" + metadata.get(Metadata.RESOURCE_NAME_KEY)).toPath();
Files.deleteIfExists(outputPath);
Files.copy(stream, outputPath);
}
};
context.set(EmbeddedDocumentExtractor.class, embeddedDocumentExtractor);
context.set(AutoParser.class, parser);
setPdfConfig(context);
try (InputStream stream = new FileInputStream(path)) {
parser.parse(stream, handler, metadata, context);
xhtmlContents = handler.toString();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException | TikaException e) {
e.printStackTrace();
}
return xhtmlContents;
}