I am using iText PdfTextExtractor to extract text from the PdfReader, where the PdfReader is created from a byte array,
byte[] pdfbytes = outputStream.toByteArray();
PdfReader reader = new PdfReader(pdfbytes);
int pagenumber = reader.getNumberOfPages();
PdfTextExtractor extractor = new PdfTextExtractor(reader);
for(int i = 1; i<= pagenumber; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
String line = extractor.getTextFromPage(i);
System.out.println(line);
}
The first test pdf is from: http://www.gnostice.com/downloads/Gnostice_PathQuest.pdf I can print out the first page, but get the follow exception at the second page
Exception:
Exception in thread "main" ExceptionConverter: java.io.IOException: Error reading string at file pointer 238291
at com.lowagie.text.pdf.PRTokeniser.throwError(Unknown Source)
at com.lowagie.text.pdf.PRTokeniser.nextToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.nextValidToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.readPRObject(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.parse(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source)
at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source)
at org.xxx.services.pdfparser.xxxExtensionPdfParser.main(xxxExtensionPdfParser.java:114)
where xxxExtensionPdfParser.java:114 is String line = extractor.getTextFromPage(i);
But at second test at http://www.irs.gov/pub/irs-pdf/fw4.pdf, I can get text content without exception. So i think it must be the format issue of first pdf that causes the exception.
So my question is, what is this format issue and is there anyway to avoid it? Thanks.