What to do with CIDs in text extracted by PDFMiner

2019-03-04 18:28发布

I've some PDFs which are in Hindi, and have extractable text. I used pdfminer.six for python 3.6, to do the extraction. The output looks like:
enter image description here

As one can see, there are a number of characters that are converted into the form "(cid :number)".

On further analysis, I found out that a PDF contains CMAPs which map character codes to glyph indices. So, a CID is a character identity for the glyph it maps to, inside the CMAP table.

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

Moreover, according to a comment to this similar question, this process may not be legal. But I'm not trying to steal someone's font. I want the text. How does this process become illegal?

Since there are many questions like this one, I want to clarify that I do not aim at solving the "cid" problem. I want to clarify the reasons for the problem and reasons for it's illegality.

EDIT: This issues page for pdfminer discusses this issue, where the author clearly says that there seems to be no reliable workaround for this issue. Is there some general, basic limitation (like, no access to font) that makes this issue continual?

1条回答
劫难
2楼-- · 2019-03-04 19:00

But how are these character codes related to Unicode values? Basically, how is a PDF viewer able to show the glyph using this mapping?

The character codes one finds in the PDF content streams do not need to be related to Unicode values in any obvious way. In particular, a PDF viewer does not at all need a Unicode code point for a character code to show the matching glyph.

In a PDF a font has a mapping (or a sequence of mappings) from character code to glyph ID in the font program, and this mapping may be completely arbitrary.

E.g. in case of embedded font subsets the subset font program often is created by giving the first glyph used on a page a starting glyph id n, then giving the second, different glyph on that page id n+1, then the next, different glyph id n+2 etc etc, and then the character codes often are identical to the glyph ids, i.e. the mapping above is the identity mapping. If there are no additional information anymore, a text extractor has no chance to properly do its job.

I want to clarify the reasons for the problem

Regular text extraction usually has the following options to find the Unicode value for a character code:

  • A PDF font may include a ToUnicode map (mapping from character code to Unicode) to support operations like searching strings or copy & paste in a PDF viewer. This map immediately provides the mapping the text extractor needs.

    Beware, though: these ToUnicode maps can be incomplete and sometimes even contain intentionally incorrect mappings!

  • The PDF font encoding definition may contain the name of a pre-defined standard encoding (e.g. WinAnsiEncoding or GBpc-EUC-H) or a standardized character name (e.g. space, seven, or ntilde) for a given code. A text extractor merely needs to know the encoding represented by that encoding name or the code represented by that character name.

    But the Encoding may also be an identity (Identity–H and Identity–V with character code = glyph code) which doesn't give away anything, and a character name may also be non-standardized (e.g. g17).

The PDF specification says: If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

In case of your text extraction output I would guess the PDF font has an incomplete ToUnicode map.

Actually there are some more locations to look for additional information, e.g. the font program might include an own mapping of its glyphs to Unicode, but those additional information also are optional.

... and reasons for it's illegality.

In case of all the above options I don't see any sensible font license being violated, in particular as most of those options didn't even look into the font program (e.g. the *.ttf) itself, merely into the PDF metadata wrapping it.

On the other hand, if e.g. you had the idea to construct ToUnicode maps for those fonts missing such a map by drawing each glyph of the font onto a bitmap, nicely separated from anything else, and applying OCR to it, you as the recipient of the PDF suddenly would use the font program to draw something else than the original document, and this might be considered usage not covered by the license.

查看更多
登录 后发表回答