I am extracting text from PDF forms with Apache PDFBox 2.0.1, extracting the details of AcroForm fields. From a radio button field I dig up the appearance dictionary. I'm interested in the /N and /D entries (normal and "down" appearance). Like this (interactive Bean shell):
field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
ap = annot.getAppearance();
keys = ap.getCOSObject().getDictionaryObject("N").keySet();
keyList = new ArrayList(keys.size());
for (cosKey : keys) {keyList.add(cosKey.getName());}
print(String.join("|", keyList));
}
The output is
Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off
The question mark blotches should be Swedish characters "ä" or "å". Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.
Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?
This sample PDF form can be downloaded here: http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf
Changing the assumed encoding
PDFBox' interpretation of the encoding of bytes in names (only names can be used as dictionary keys in PDFs) takes place in
BaseParser.parseCOSName()
when reading the name from the source PDF:As you can see, after reading the name bytes and interpreting the # escape sequences, PDFBox unconditionally interprets the resulting bytes as UTF-8 encoded. To change this, therefore, you have to patch this PDFBox class and replace the charset named at the bottom.
Is PDFBox correct here?
According to the specification, when treating a name object as text
(section 7.3.5 Name Objects, ISO 32000-1)
BaseParser.parseCOSName()
implements just that.PDFBox' implementation is not completely correct, though, as already the act of interpreting the name as string without need is wrong:
Thus, PDF libraries should handle names as byte arrays as long as possible and only find a string representation when it is explicitly required, and only then the recommendation above (to assume UTF-8) should play a role. The specification even indicates where this may cause trouble:
Another situation becomes apparent in the document at hand, if the sequence of bytes constitutes no valid UTF-8, it still is a valid name. But such names are changed by the method above, any unparsable byte or subsequence is replaced by the Unicode Replacement Character '�'. Thus, different names may collapse into a single one.
Another issue is that when writing back a PDF, PDFBox is not acting symmetrically but instead interprets the
String
representation of the name (which has been retrieved as a UTF-8 interpretation if read from a PDF) using pureUS_ASCII
, cf.COSName.writePDF(OutputStream)
:Thus, any interesting Unicode character is replaced with the US_ASCII default replacement character which I assume to be '?'.
So it is quite fortunate that PDF names most often do merely contain ASCII characters... ;)
Historically
According to the implementation notes from the PDF 1.4 reference,
Thus, the sample document at hand seems to follow conventions from Acrobat 4, i.e. from the last century.
Source code excerpts are from PDFBox 2.0.0 but at first glance do not seem to have been changed in 2.0.1 or the development trunk.