For some reason itextsharp is now reading pdf which contains numbers such as 4123 as 4*23 where the * is actually a an arrow pointing up. Not sure why this is happening. Please help.
Thanks.
Sample file is located here: https://dl.dropboxusercontent.com/u/116833/SAMPLE%20PDF.pdf
The reason for the arrows is that the file actually tries to mislead text extractors which extract text according to the guidelines of Section 9.10.2 Mapping Character Codes to Unicode Values of the PDF specification ISO 32000-1 while not confusing those which prefer ActualText marked-content sequence entries: The former method is lead to believe the '3's are arrows while the latter is told the '3's are threes.
Most likely this is done to prevent automated text extraction while allowing manual copy&paste because Adobe Reader does prefer the ActualText marked-content sequence entries (thus, manual extraction works all right) while many programmatic extractors prefer the former method.
As far as I read the relevant sections of the specification, it prefers neither way over the other.
Details
E.g. look at the first part number:
BT
/T1_1 1 Tf
10 0 0 10 69.1456 750.2834 Tm
(1 )Tj
ET
EMC
/Span <</MCID 14 >>BDC
BT
/T1_1 1 Tf
10 0 0 10 89.5488 750.2834 Tm
(2)Tj
/Span<</ActualText<FEFF0033>>> BDC
(3)Tj
EMC
(412109 )Tj
ET
EMC
As you see the '3' is marked with an ActualText entry indicating that it is a three indeed (<FEFF0033>
is a long way to indicate the Unicode digit three).
The font T1_1, on the other hand, offers a ToUnicode stream containing the mapping
...
<30> <0030>
<31> <0031>
<32> <0032>
<33> <0018>
<34> <0034>
<35> <0035>
...
As you see while other digits (0x30 is '0', 0x31 is '1', ... , 0x39 is '9') are mapped identically, the '3', i.e. 0x33, is mapped to the Unicode code point 0x0018, and
U+0018 is the Unicode hex value of the character <control>
, which is categorized as "control character" in the Unicode 6.0 character table.
"<control>
" was previously named "CANCEL" in older versions of Unicode.
(cf. http://www.marathon-studios.com/unicode/U0018/Control)
In some context this control character is displayed as an upwards arrow.