Ok, I've done some research on the subject but as the title indicates I'm no expert. So here's the problem: I'm extracting some text from pdfs using python and the lib pdfminer.
I've only tried documents with latin characters and it works well in most cases, except if the font is not latin/western. The document that bugs me now is using latin characters from a japanese font. Adobe tells me the encoding is Adobe-Identity
. All I get is the cid of the char and I can't find the cidmap related.
I know I'm not using the right terms, I mean the pdf tells me cid=3
and I know the char is a space. I've manually written a map for the chars in the range 0x00-0xFF
. Some sources tells it matches the "mac-roman" encoding, other disagrees. Other sources says it match OpenType mapping but I couldn't find anything beyond 0xFF
. And I've got cids >3000.
You can tell I'm very confused, so you're invited to correct my terminology but what I'd want is a map that matches my own but extended for the range 0x0100-0xFFFF
.
ETA: the link to the bugging pdf http://www.sas.upenn.edu/~jtigay/JapanVol.pdf
ETA2: I found this ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/adobe/aj14.tar.Z the cid2code.txt within the archive is the kind of map I'm looking for. But for all those fonts the cid column seems "shifted" by two: cid1 maps to space.
ETA3: corrected encoding
You might be searching for the tables provided in the Adobe Developer Support Technical Note #5078
in combination with the background knowledge provided by the Technical Note #5014
Unfortunately you have not provided the document that bugs you; thus, I cannot check whether the link really is appropriate.
EDIT
As you corrected your question and are now asking for the special-purpose Adobe-Identity-0 ROS (“ROS” is an abbreviation for /Registry, /Ordering, and /Supplement, which represent the three /CIDSystemInfo dictionary elements that are present in CIDFont and CMap resources) instead of Adobe-Japan1-?, the links above aren't of interest for you anymore. Unfortunately, though, Adobe-Identity seems to be the ROS of choice whenever none of the public ROSes is applicable. Thus, there seems to be no generic answer to your question for a map CID to unicode.
Furthermore, the /ToUnicode maps of the Times fonts in your PDF all look like this:
(Here the CIDSystemInfo interestingly differs from that in the PDF font object itself, Adobe-Identity-0.)
According to the PDF specification ISO 32000-1:2008 section 9.10.3, though,
Thus, there is no usable mapping defined which according to the same spec in combinations with other aspects of those fonts implies that