I finally got some sort of pdf scanner to work. It reads into the callback functions without a problem, but when I try to NSLog the result from a CGPDFScannerPopString I get a result like this:
ˆ ˛˝ # ˜˜˜ #˜' ˜˜˜ "˜ '˜˜ " ' ˜˜
No string to be found here...
Any ideas of what it can be?
This is my callback function:
static void op_Tj (CGPDFScannerRef s, void *info)
{
CGPDFStringRef string;
if (!CGPDFScannerPopString(s, &string))
return;
NSLog(@"string: %@", (__bridge NSString *)CGPDFStringCopyTextString(string));
}
Thanks already!
Edit: Example PDF
You should be aware that the CGPDFStringRef is not a ASCII string or something similar at all. Cf. http://developer.apple.com/library/mac/documentation/graphicsimaging/Reference/CGPDFString/Reference/reference.html --- it is a "series of bytes—unsigned integer values in the range 0 to 255" which have to be interpreted according to the latest PDF reference.
The PDF reference in turn will tell you that the interpretation of the bytes depends on the font used, and while ASCII-like interpretations are common in case of European languages, they are not mandatory, and in case of Asian languages where font subset embedding is very common, the interpretation may look random.
CGPDFStringCopyTextString tries to interpret those bytes accordingly, but there does not have to be a sensible interpretation as a regular string.
EDIT Inspection of the sample PDF Ron supplied showed that in case of this sample indeed the encoding of the font in object 3 0 (which is dominant on most pages of the document) is not a standard encoding but instead:
<</Type/Encoding
/Differences[0/.notdef/C/O/V/E/R/space/slash/H/L/F/underscore/W/B/five/eight/four
/zero/two/six/D/one/period/three/Z/I/N/G/U/S/T/colon/seven/A/M/P/Y
/plus/nine/X/hyphen/i/s/p/a/t/c/h/n/f/o/K/greater/equal/l/m/y/J/Q
/parenleft/parenright/comma/dollar/ampersand/d/r/v/b/e/u/w/k/g/x/bar
/quotesingle/asterisk/q/question/percent]
>>
Looking at the top of the first document page
COVER / HLF_CWEB_58408485 / 58408485 / 26DEC12 10.30.22Z
BRIEFING INCLUDES FOLLOWING FLIGHTS:
26DEC12 OR0337 EHAM0630 MUVR1710 PHOYE VSM+2/8 179
NEXT FLIGHTS OF AIRCRAFT:
26DEC12 OR0338 MUVR1830 MMUN1940 PHOYE VSM+2/8 213
26DEC12 OR0338 MMUN2105 EHAM0655 PHOYE GPT+2/7 263
27DEC12 OR0365 EHAM0900 TNCB1930 PHOYE BAH+1/8 272
27DEC12 OR0366 TNCB2030 TNCC2110 PHOYE BAH+1/8 250
27DEC12 OR0366 TNCC2250 EHAM0835 PHOYE ASD+1/8 199
that encoding seems to have been created by dealing out the next number starting from one for the next required glyph. This obviously results in a highly individualistic encoding...
That being said the font object does include both an /Encoding entry and a /ToUnicode entry. Thus, if the method CGPDFStringCopyTextString was given a reference to the font here and really tried, it would easily be able to correctly translate those bytes into the corresponding text. That it doesn't achieve anything decent, seems to indicate that it simply does not have the information which font to interpret the bytes for --- I don't assume it doesn't try...
For accurate text extraction, therefore, you have to interpret the bytes in the CGPDFStringRef yourself using the information of the the font in the content stream. If you don't want to do that from scratch, you might be interested in PDFKitten, a framework for extracting data from PDFs in iOS. While it is not yet perfect (some font structures can baffle it), it is a good starting point.