How To read control characters in a pdf using java

2020-07-30 02:53发布

问题:

I'm using PDFBox to read PDF files. But some characters are not printing well and printing like control characters. Some one help to read the values from the control characters. I've attached the image Kindly have a look at that image Sample PDF:

Screenshot:

Sample Code

class PDFManager {

   private PDFParser parser;
   private PDFTextStripper pdfStripper;
   private PDDocument pdDoc ;
   private COSDocument cosDoc ;

   private String Text ;
   private String filePath;
   private File file;

   public PDFManager() {

   }

   public String ToText() throws IOException {
       this.pdfStripper = null;
       this.pdDoc = null;
       this.cosDoc = null;
       file = new File(filePath);
       parser = new PDFParser(new FileInputStream(file));

       parser.parse();
       cosDoc = parser.getDocument();
       pdfStripper = new PDFTextStripper(); 
       pdDoc = new PDDocument(cosDoc);  
       pdDoc.getNumberOfPages();
       pdfStripper.setStartPage(3);
       pdfStripper.setEndPage(4); 
       Text = pdfStripper.getText(pdDoc);

       return Text;
   }

   public void setFilePath(String filePath) {
       this.filePath = filePath;
   }
}

回答1:

The reason why you get both correct Tamil letters and incorrect control sequences, is that the respective fonts

  • don't have a ToUnicode map and
  • have an Encoding entry using non-standard names for some glyphs.

In such a situation PDFBox cannot properly extract associated characters without help.

To help PDFBox you have to check whether in all the documents (or at least in a sufficiently big subset to be of interest) for each non-standard name the drawn glyphs are identical. If this is the case, you can tell PDFBox to add mappings from each of these names to the Unicode value of the respectively drawn letter to its reservoir of known glyph mappings.

In more detail:

The issue

I'll illustrate the issue here with an example.

On the page 3 mentioned by the OP the first text is drawn using instructions equivalent to these:

/R9 8.04 Tf
0.999418 0 0 1 519.6 791.721 Tm
[<01>6.75242<0C>-0.371893<0D>4.89295<07>3.77727<14>-6.13989<35>-4.51376<02>-5.00233<0F>187.988]TJ 

(I merely changed the representation of the strings to hexadecimal as the individual codes mostly are in the control character range and, therefore, would not display properly here.)

The font R9 of this page does not have a ToUnicode map. Neither are there any ActualText entries. Thus, PDFBox can merely use the Encoding entry of the font:

<<
  /BaseEncoding/WinAnsiEncoding
  /Differences[1
    /u0BC6/u0B9A/g125/u0BC8/u0BA9/g121/u0B9F/u0BAE
    /u0BB1/g123/space/u0BA4/u0BBE/g148/u0BBF/u0B8E
    /g122/u0BAA/u0BAF/g129/g130/g178/g127/u0B92
    /g162/g116/u0B95/u0BC0/g158/u0BA8/u0BB2/colon
    /u0B85/g117/g173/g132/u0BB3/g182/g142/one
    /period/g175/u0BB5/u0BB0/g126/u0B86/u0BC7/g186
    /g156/g131/g143/two/g118/g133/g190/hyphen
    /zero/five/g171/g120/g146/g169/g152/parenleft
    /seven/parenright/three/g180/u0BA3/eight/g136/u0BB4
    /u0B9C/four/six/g124/nine/g135/slash/g172
    /comma/u0B87/numbersign/g128/g147/g160/u0B9E/u0B89
    /u0BB7/g119/g157/g167/g191/g188/g170/g145
    /g181/u0BB8/u0B90/uni25CC/u0BCD/u0BB9/u0BC1/u0B88
    /g163/u0BD7/g184/u0B8F/g174/g153/g138/g185
    /g134/g149/g176]
  /Type/Encoding
>> 

As you see it first claims a base encoding WinAnsiEncoding which can be ignored because more or less all the mappings in the range of the codes the font uses are then replaced in the Differences array.

In the Differences array you can find

  • a number of standard names like comma and two;
  • many names representing unicode code points using the uXXXX scheme, like u0BC6 and u0B9A;
  • one name representing a unicode code point using the uniXXXX scheme: uni25CC; and
  • many completely non-standard names using a gXXX scheme like g121 and g176.

PDFBox supports standard names (obviously) and additionally both used unicode code point naming variants (which are often found and whose interpretation is very straight forward).

It does not support other names out of the box.

Thus, the text extracted for that first text drawing instruction is:

<01> - /u0BC6 - 0BC6 - ெ
<0C> - /u0BA4 - 0BA4 - த
<0D> - /u0BBE - 0BBE - ா
<07> - /u0B9F - 0B9F - ட
<14> - /g129 ?? 0014 - <DEVICE CONTROL FOUR>
<35> - /g118 ?? 0035 - 5
<02> - /u0B9A - 0B9A - ச
<0F> - /u0BBF - 0BBF - ி

resulting in your first line of extracted text:

By the way, this corresponds to this section from the actual PDF:

A possible way to allow proper extraction

PDFBox provides mechanisms allowing you to add names to its map of known names. If those gXXX names regularly represent the same respective character in your documents, therefore, you can tweak PDFBox text extraction to meet your requirements.

The stable PDFBox version 1.8.X use a different mechanism than the 2.0.0 release candidates. Thus:

For PDFBox 1.8.X you have to create a glyph list text file. For each glyph it contains one linewith 2 semicolon-delimited fields, the glyph name and the Unicode scalar value, e.g.

A;0041
AE;00C6

Then you define a system property glyphlist_ext pointing towards that list, e.g. when starting your program

java -Dglyphlist_ext=/path/to/my/extra/glyphs ...

For PDFBox 2.0.0 this mechanism has been replaced and moved multiple times, I have no idea which is the current one.

While working on PDFBOX-2379, an exception has been introduced to be thrown if the above mentioned system property is found:

throw new UnsupportedOperationException("glyphlist_ext is no longer supported, "
    + "use GlyphList.DEFAULT.addGlyphs(Properties) instead");

Unfortunately GlyphList does not have that method addGlyphs anymore.

While working on PDFBOX-2380 it has been removed and replaced:

I've replaced the static DEFAULT glyph list with a getAdobeGlyphList() method, as some PDFBox font internals require this to be the AGL and not some other additional glyph list. The loading and use of the additional glyphlist is application specific and so has been moved to PDFStreamEngine, where the getGlyphList() method can be overridden to pass custom glyph lists to fonts.

Unfortunately, PDFStreamEngine does not have that getGlyphList method anymore.

And I'm not currently in a mood to continue hunting high and low to find that feature again. Arg.

The making-of

In a comment the OP asked how I retrieved the information above from the PDF file in question.

First of all I used a PDF internals browsing application, e.g. iText RUPS or PDFBox PDFDebugger, to inspect the PDF and the PDF specification ISO 32000-1 to understand what I'm inspecting.

The OP in particular pointed to page 3 of his document, so I looked for the first text showing operators (cf. ISO 32000-1 section 9.4.3) in the content stream (cf. ISO 32000-1 section 7.8.2) of that page (cf. ISO 32000-1 section 7.7.3.3):

These almost are the instructions I quoted above. As you see, though, the character strings unfortunately cannot be inspected here because their contents are mostly in the Unicode control character range. Thus, I saved the contents stream (right-click, context menu) and inspected a hex view of those instructions

Using these information I created the instruction quote above.

The font (cf. ISO 32000-1 section 9.6) selected in those instructions is R9 (cf. ISO 32000-1 section 9.3.1), so I continued by looking at the font resource (cf. ISO 32000-1 section 7.8.3) with that name on page 3, first unsuccessfully searching a ToUnicode entry (cf. ISO 32000-1 section 9.10.3), then succeeding in finding an Encoding (cf. ISO 32000-1 section 9.6.6):

This I copied and prettied-up somewhat to get the encoding quote above.

From these information I manually created the table with the glyph ids (from the text showing operation in the instruction block), the corresponding name (from the encoding differences), the assumed unicode code point (derived from uXXXX names, for gXXX names the glyph id again), and the character (from one of the many Unicode table sites on the internet).

To find the corresponding section from the actual PDF page I took the final two arguments of the Tm text matrix setting operation (cf. ISO 32000-1 section 9.4.2) taking the aggregated transformation matrix changes (cf. ISO 32000-1 section 8.4) into account. These are the coordinates of the start of the base line of the text drawn by the following text showing instruction.