Cidfonts and mapping

2019-09-06 00:12发布

Ok, I've done some research on the subject but as the title indicates I'm no expert. So here's the problem: I'm extracting some text from pdfs using python and the lib pdfminer.

I've only tried documents with latin characters and it works well in most cases, except if the font is not latin/western. The document that bugs me now is using latin characters from a japanese font. Adobe tells me the encoding is Adobe-Identity. All I get is the cid of the char and I can't find the cidmap related.

I know I'm not using the right terms, I mean the pdf tells me cid=3 and I know the char is a space. I've manually written a map for the chars in the range 0x00-0xFF. Some sources tells it matches the "mac-roman" encoding, other disagrees. Other sources says it match OpenType mapping but I couldn't find anything beyond 0xFF. And I've got cids >3000.

You can tell I'm very confused, so you're invited to correct my terminology but what I'd want is a map that matches my own but extended for the range 0x0100-0xFFFF.

ETA: the link to the bugging pdf http://www.sas.upenn.edu/~jtigay/JapanVol.pdf
ETA2: I found this ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/aj14.tar.Z the cid2code.txt within the archive is the kind of map I'm looking for. But for all those fonts the cid column seems "shifted" by two: cid1 maps to space.
ETA3: corrected encoding

1条回答
对你真心纯属浪费
2楼-- · 2019-09-06 00:16

You might be searching for the tables provided in the Adobe Developer Support Technical Note #5078

Adobe-Japan1-6 Character Collection for CID-Keyed Fonts

in combination with the background knowledge provided by the Technical Note #5014

Adobe CMap and CIDFont Files Specification.

Unfortunately you have not provided the document that bugs you; thus, I cannot check whether the link really is appropriate.

EDIT

As you corrected your question and are now asking for the special-purpose Adobe-Identity-0 ROS (“ROS” is an abbreviation for /Registry, /Ordering, and /Supplement, which represent the three /CIDSystemInfo dictionary elements that are present in CIDFont and CMap resources) instead of Adobe-Japan1-?, the links above aren't of interest for you anymore. Unfortunately, though, Adobe-Identity seems to be the ROS of choice whenever none of the public ROSes is applicable. Thus, there seems to be no generic answer to your question for a map CID to unicode.

Furthermore, the /ToUnicode maps of the Times fonts in your PDF all look like this:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end

(Here the CIDSystemInfo interestingly differs from that in the PDF font object itself, Adobe-Identity-0.)

According to the PDF specification ISO 32000-1:2008 section 9.10.3, though,

it shall use the beginbfchar, endbfchar, beginbfrange, and endbfrange operators to define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.

Thus, there is no usable mapping defined which according to the same spec in combinations with other aspects of those fonts implies that

there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

查看更多
登录 后发表回答