iText PdfTextExtractor getTextFromPage exception “

2019-08-13 04:11发布

问题:

I am using iText PdfTextExtractor to extract text from the PdfReader, where the PdfReader is created from a byte array,

    byte[] pdfbytes = outputStream.toByteArray();

    PdfReader reader = new PdfReader(pdfbytes);

    int pagenumber = reader.getNumberOfPages();
    PdfTextExtractor extractor = new PdfTextExtractor(reader);

    for(int i = 1; i<= pagenumber; i++) {
        System.out.println("============PAGE NUMBER " + i + "=============" );
        String line = extractor.getTextFromPage(i);
        System.out.println(line);
    }

The first test pdf is from: http://www.gnostice.com/downloads/Gnostice_PathQuest.pdf I can print out the first page, but get the follow exception at the second page

Exception:

Exception in thread "main" ExceptionConverter: java.io.IOException: Error reading string at file pointer 238291
at com.lowagie.text.pdf.PRTokeniser.throwError(Unknown Source)
at com.lowagie.text.pdf.PRTokeniser.nextToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.nextValidToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.readPRObject(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.parse(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source)
at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source)
at org.xxx.services.pdfparser.xxxExtensionPdfParser.main(xxxExtensionPdfParser.java:114)

where xxxExtensionPdfParser.java:114 is String line = extractor.getTextFromPage(i);

But at second test at http://www.irs.gov/pub/irs-pdf/fw4.pdf, I can get text content without exception. So i think it must be the format issue of first pdf that causes the exception.

So my question is, what is this format issue and is there anyway to avoid it? Thanks.

回答1:

I am getting the same error and upon some investigation, it seems that the problem with my pdf documents is that they contain 'header' or 'footer' as opposed to the irs document you've linked. I indexed a 900 page pdf document and about 70 of the pages fail to extract. Apparently, all these pages have a footer copyright information. Any ideas how to resolve this issue ?

------EDIT ---------- I applied the following method to get text out from the aforementioned pdf. Hope this works for you as well.


PdfReader pdfReader = new PdfReader(file);
PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);

strategy = parser.processContent(currentPage, new SimpleTextExtractionStrategy());              
content = strategy.getResultantText();


回答2:

    byte[] pdfbytes = outputStream.toByteArray();

    PdfReader reader = new PdfReader(pdfbytes);

    int pagenumber = reader.getNumberOfPages();
    PdfTextExtractor extractor = new PdfTextExtractor(reader);

    for(int i = 1; i<= pagenumber; i++) {
        System.out.println("============PAGE NUMBER " + i + "=============" );
        String line = PdfTextExtractor.getTextFromPage(reader,i);
        System.out.println(line);
    }

replace your code with this it will work fine..