How can i read pdf with itext?

2019-06-09 20:47发布

问题:

Now i have error: May 08, 2018 12:27:47 PM toUnicode

WARNING: No Unicode mapping for CID+88 (88) in font 404198E5f54TimesNewRoman

And empty result. If it will need I can give a file.

回答1:

Your sample PDF does not contain the information required for text extraction.

The document uses subset fonts with ad-hoc encodings: The first glyph of the respective font used on a page is encoded by some start value n, the next used, different glyph by n+1, the next used, different glyph by n+2,...

E.g. is hexadecimally encoded as 000a 000b 000c 000d 000e 000f 0010 for the first word and 0011 0012 0013 000c 000d 0010 0014 0015 0016 for the second word plus the colon. You can recognize the codes 000c, 000d, and 0010 in the second word corresponding to glyphs that already have been used in the first word.

Obviously this encoding without any extra information does not allow text extraction, how should a program map those values to Unicode?

The PDF format does have options to include a map from those encoding values to Unicode but unfortunately the fonts in your file don't include such mappings.

Thus, your file does not allow text extraction, you need to use OCR instead.



回答2:

A PDF with text contains syntax that draws glyphs on a page. The shapes of these glyphs are stored in a font. The syntax used for the page uses characters to refer to the glyphs.

For instance:

12334 54637

Is a possible representation of:

Hello World

Where you have the following mapping:

`1` = `H`
`2` = `e`
`3` = `l`
`4` = `0`
` ` = ` `
`5` = `W`
`6` = `r`
`7` = `d`

When you look at the page as a human, you see "Hello World", but when a machine looks at the syntax of the page, it sees "12334 54637" and that's also what you get if you extract the content without using a toUnicode mapping.

The mapping that I just described ( 1 = H, 2 = e, 3 = l,...) is stored in an object that maps characters used in a page with Unicode characters. If that map is missing, there is no way of extracting the content correctly.

The error you mention No Unicode mapping for CID+88 (88) in font 404198E5f54TimesNewRoman informs you that information is missing in your PDF, hence you can't get a reliable result. You can see the correct text with your human eyes, but a machine can't resolve the text to a useful string.

If this answer doesn't satisfy you, please share the PDF so that we can prove that this answer is correct. Also: you don't mention which version of iText you're using. Older versions are usually not that good at extracting text when compared to the more recent versions (iText 7.1.2 being the most recent release).



标签: java pdf itext