Strange characters reading PDF with iTextSharp

2019-07-25 02:31发布

问题:

I'm using iTextSharp to read a PDF file. I try to read the full text in the first page with this simple code:

var pdfReader = new PdfReader("<fileName>");
var pageText = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new SimpleTextExtractionStrategy());

It returns a string like this:

"\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 !\n\" \0 \0 \0 \0 \0 \0 # \0 $ \0 % \0 & $ \0 ’ \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 !\n\" \0 \0 \0 (\n\0 \0 \0 ) \0 \0 * \0 + , \0 , \0 \0 & , \0 - \0 . # \0 \0 \0 & $ \0 , \0 /\n+ \0 & & \0 * 0 \0 1 .\n2 \0 3\n4 - \0 5 \0 \0 $ \0 \0 # \0 \0 \0 & $ \0 , \0 * & \0 \0 ’ \0 .\n6\n\0 \0 \0 - \0 \0 \0 \0 & \0 \0 \0 \0 \0 \0 \0 , \0 # \0 \0 \0 & $ \0 , \0 \0 \0 & \0 # \0 \0 & $ ’ ) & \0 \0 \0 \0 # \0 ’ ’ \0 7 - \0 $ \0 \0 7 \0 ’ \0 , \0 8\n9 5 \0 \0 , \0 \0 $ $ \0 \0 \0 \0 \0 ’ \0 \0 3\n\0 \0 \0 ) \0 \0 \0 \0 4 - \0 5 \0 \0 $ \0 \0 * & \0 \0 ’ \0 .\n\0 \0 \0 \0 # \0 $ \0 $ \0 \0 ) \0 \0 \0 : 0 ; \0 ; < ; : 1 ; + \0 = < 9 = < < > \0 ? \0 ? \0 3 \0 (\n@\n\0 \0 # \0 $ \0 % \0 & $ \0 ’ \0 ! 3\n\0 ......"

I can read the original PDF with Acrobat Reader and browsers. The file seems to be a PDF/A.

The code I use works with other PDF.

Does iText have problem with this standard?

Can someone point me to the right direction?

Update

Copy/paste from Acrobat gives me broken text. I don't think it's an iTextSharp (5.5.10) problem.

Update

You can try with this file: PDF Example

回答1:

The file does not contain information required for text extraction. Furthermore, the file is invalid as a PDF/A file.

Information for text extraction

The sample file contains a background (located in a form XObject resource) showing the empty form and a foreground (immediately in the page content stream) of filled-in values.

The text in the form XObject is drawn using a Type 3 font without a standard encoding or standard names in its encoding. There also is no ToUnicode map in it.

This means that text drawing instructions in that form XObject have arguments which are sequences of bytes, and for each byte value the Type 3 font object provides a stream containing simple drawing instructions (path definitions using lines and curves; path filling instructions), but there is no information which Unicode value corresponds to that byte value or set of drawing instructions.

Thus, PDF viewers can draw the page but they cannot correctly put a Unicode string of characters into the clipboard which we as humans would read from that drawing, and neither can iTextSharp.

Short of OCR there is no reasonable way to extract text from the form.


The text immediately in the foreground, on the other hand, is drawn using a font with a standard encoding (WinAnsiEncoding) and, therefore, can be extracted. Thus, at the end of the output of the OP's code you'll find

\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000

 ...

\u0000 \u0000 \u0000 x s \u0000 l t n q o x m l \u0000 z \u0000 ~ { \u0000 } } \u0000 l w x
2016
14874587948 DITTA PROVA SRL
CREMA CR 26013 VIA DANTE 17
011110
LPRGCM82T26D150H LEOPARDI GIACOMO
M 26 12 1982 CREMONA CR
MILANO MI F205
28 02 2017
DITTAP0101 / LEOGIA01001

i.e. the filled-in values of the form.

PDF/A conformance

The file indeed claims to be PDF/A-1a but inspecting it one quickly sees that this is a blatant lie. E.g. Adobe Acrobat Preflight says:

These entries indicate that the document actually does not even try to actually be PDF/A-a1 conform, it merely claims so.



标签: c# pdf itext