Getting Text fonts from a pdf file using iText

2019-03-02 11:20发布

I have been trying to extract the attributes(font, font size, color etc.) of each word in a pdf document using iText library. I could extract the text from every page but not the attributes. Also i didn't find anything that could provide the same as such. Please help me.

回答1:

I'm not a Java person so I can't give you working code but hopefully I can get you 95% of the way there.

First you'll need to create a class that implements the interface com.itextpdf.text.pdf.parser.TextExtractionStrategy

Then you can pass an instance of this class as the third parameter to:

PdfTextExtractor.getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy)

One of the methods of that interface is renderText which gets called for every text block that gets processed. When it gets called a TextRenderInfo gets passed in which has a method called getFont which should give you what you're looking for. Store the contents of that in a buffer of some sort and after getTextFromPage is called you can inspect that buffer to see each font. If you want to see an example of implementing that interface lookup the code for SimpleTextExtractionStrategy online. Otherwise here's a C# version that pretty much does what you're looking for.