Remove PDFont caching with Apache tika

2019-03-04 06:19发布

I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.

My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.

I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:

        AutoDetectParser parser = new AutoDetectParser();

        BodyContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();
        FileInputStream inputstream = new FileInputStream(new File(child.getPath()));
        ParseContext context = new ParseContext();              
        parser.parse(inputstream, handler, metadata, context);
        String s=null;
        s =handler.toString();
        handler=null;
        context=null;
        inputstream.close();
        PDFont.clearResources();

标签： pdfbox apache-tika

1条回答

Summer. ? 凉城

2楼-- · 2019-03-04 06:34

So I fudged a workaround and just called System.gc(); everytime the file had finished being processed which works a treat but doesn't really answer the question.

0人赞添加讨论(0) 举报

Remove PDFont caching with Apache tika

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间