Advanced PDF parser for Java

I want to extract different content from a PDF file in Java:

The complete visible text
images
links

Is it also possible to get the following?

document meta tags like title, description or author
only headlines
input elements if the document contains a form

I do not need to manipulate or render PDF files. Which library would be the best fit for that kind of purpose?

UPDATE

OK, I tried PDFBox:

Document luceneDocument = LucenePDFDocument.getDocument(new File(path));
Field contents = luceneDocument.getField("contents");
System.out.println(contents.stringValue());

But the output is null. The field "summary" is OK though.

The next snippet works fine.

PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(doc);
System.out.println(text);
doc.close();

But then, I have no clue how to extract the images, links, etc.

UPDATE 2

I found an example how to extract the images, but I still got no answer on how to extract:

links
document meta tags like title, description or author
only headlines
input elements if the document contains a form

标签： java parsing pdf

5条回答

戒情不戒烟

2楼-- · 2019-01-08 16:19

iText is my PDF tool of choice these days.

The complete visible text

"Visible" is a tough one. You can parse out all the parsable text with the com.itextpdf.text.pdf.parse package's classes... but those classes don't know about CLIPPING. You can constrain the parser to the page size easily enough.

// all text on the page, regardless of position
PdfTextExtractor.getTextFromPage(reader, pageNum);

You'd actually need the override that takes a TextExtractionStrategy, the filtered strategy. It gets interesting fairly quickly, but I think you can get everything you want here "out of the box".

images

Yep, via the same package classes. Image listeners aren't as well supported as text listeners, but do exist.

links

Yes. Links are "annotations" to various PDF pages. Finding them is a simple matter of looping through each page's "annotations array" and picking out the link annotations.

PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
ArrayList<String> dests = new ArrayList<String>();
if (annots != null) {
  for (int i = 0; i < annots.size(); ++i) {
    PdfDictionary annotDict = annots.getAsDict(i);
    PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
    if (subType != null && PdfName.LINK.equals(subType)) {
      PdfDictionary action = annotDict.getAsDict(PdfName.A);
      if (action != null && PdfName.URI.equals(action.getAsName(PdfName.S)) {
        dests.add(action.getAsString(PdfName.URI).toString());
      } // else { its an internal link, meh }
    }
  }
}

You can find the PDF Spec here.

input elements

Definitely. For either XFA (LiveCycle Designer) or the older-tech "AcroForm" forms, iText can find all the fields, and their values.

AcroFields fields = myReader.getAcroFields();

Set<String> fieldNames = fields.getFields().keySet();
for (String fldName : fieldNames) {
  System.out.println( fldName + ": " + fields.getField( fldName ) );
}

Mutli-select lists wouldn't be handled all that well. You'll get a blank space after the colon for empty text fields and for buttons. None too informative... but that'll get you started.

document meta tags like title, description or author

Pretty trivial. Yes.

Map<String, String> info = myPdfReader.getInfo();
System.out.println( info );

In addition to the basic author/title/etc, there's a fairly involved XML schema you can access via reader.getMetadata().

only headlines

A TextRenderFilter can ignore text based on whatever criteria you wish. Font size sounds about right based on your comment.

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-01-08 16:25

Yes Alp, iText does offer the functionality you mentioned.

READING PDFS

iText isn't a PDF viewer, iText can't convert PDF to an image, nor can iText be used to print a PDF, but the PdfReader class can give you access to the objects that form a PDF document and to the the content stream of each page. This content stream can be parsed and if the content wasn't added as rasterized text, you can convert a page to plain text. Note that iText doesn't do OCR.

Use com.itextpdf.text.pdf.PdfReader; class.

0人赞添加讨论(0) 举报

成全新的幸福

4楼-- · 2019-01-08 16:26

Apache comes to the rescue, once again.

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

5楼-- · 2019-01-08 16:26

You can also use JPedal for all these extraction tasks.

0人赞添加讨论(0) 举报

放我归山

6楼-- · 2019-01-08 16:31

Most of this you can do with our PDF Library extended edition as well.

Whichever solution you go for, bear in mind that for certain PDF documents, text extraction is impossible due to the way the PDF is constructed (the glyphs on the page sometimes don't have any semantic meaning associated with them).

The quick way to check this is to open the document in Acrobat and try copying/pasting the text. If it comes up as gibberish there, chances are it will come up as gibberish in any other PDF extractor.

0人赞添加讨论(0) 举报

Advanced PDF parser for Java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间