So i'm trying to extract the text from a pdf file, I need its position, width, height, font.
I have tried many, but the most useful and complete solution looks to be PDFMiner, and in this case, more exactly pdf2txt.py.
I have followed the doc and the examples and tried to extract the text Learn More
from my pdf using this command:
pdf2txt.py -Y normal -t xml -o buttons.xml buttons.pdf
And the output buttons.xml
looks like that:
<?xml version="1.0" encoding="utf-8" ?>
<pages>
<page id="1" bbox="0.000,0.000,799.900,449.944" rotate="0">
<textbox id="0" bbox="164.979,213.240,247.680,235.944">
<textline bbox="164.979,213.240,247.680,235.944">
<text font="KZNUUP+HelveticaNeue-Bold" bbox="164.979,213.240,178.978,235.944" size="22.704">(cid:51)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="173.280,213.240,187.278,235.944" size="22.704">(cid:76)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="181.315,213.240,195.313,235.944" size="22.704">(cid:72)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="189.350,213.240,203.348,235.944" size="22.704">(cid:89)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="194.795,213.240,208.793,235.944" size="22.704">(cid:85)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="203.096,213.240,217.094,235.944" size="22.704">(cid:3)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="206.987,213.240,220.986,235.944" size="22.704">(cid:52)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="219.684,213.240,233.682,235.944" size="22.704">(cid:86)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="228.237,213.240,242.235,235.944" size="22.704">(cid:89)</text>
<text font="KZNUUP+HelveticaNeue-Bold" bbox="233.682,213.240,247.680,235.944" size="22.704">(cid:76)</text>
<text></text>
</textline>
</textbox>
<textgroup bbox="164.979,213.240,419.659,235.944">
<textbox id="0" bbox="164.979,213.240,247.680,235.944" />
</textgroup>
</page>
</pages>
The first character should be a L and 51 (cid:51)
doesn't seem to match any of the character i have in my sentence, regarding the ascii table and the utf-8 table
So as the title says, I wonder what is it, and how to use these (cid:51)...
?
EDIT
So I found that instead of getting the real character the program write (cid:%d) because he doesn't recognize that it's a unicode string.
It first call this function to write the char:
def render_char(self, matrix, font, fontsize, scaling, rise, cid):
try:
text = font.to_unichr(cid)
assert isinstance(text, unicode), text
except PDFUnicodeNotDefined:
text = self.handle_undefined_char(font, cid)
But the assert
fail and fire the event PDFUnicodeNotDefined
which is caught and calls:
def handle_undefined_char(self, font, cid):
if self.debug:
print >>sys.stderr, 'undefined: %r, %r' % (font, cid)
return '(cid:%d)' % cid
And that's how I end with a file containing all these (cid:%d).
I'm fairly new to python and I try to figure out a way to recognize these chars, it should be one no ? Does anyone has any idea ?
to understand how to interpret the cid you need to know a pair of things:
The Registry-Ordering-Supplement (ROS) information for the font in question. It's usually something like 'Adobe-Japan1-5' and is an informational property stored in the font. The ROS determines how the CIDs are to be interpreted.
Armed with the ROS info, select a compatible CMap and decode through that.You can find CMap files for the Adobe-defined ROSes at http://sourceforge.net/projects/cmap.adobe/files/
More information on CID and CMaps direct from the inventors is available at http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf
check decode CID font codes to equivalent ASCII characters for more information