Text extraction is empty and unknown for text has

2019-03-03 20:34发布

问题:

I have PDF file in Arabic that has text with font Type3 when I extract text using PDFBox some characters are empty and their font equals null? I want to know what is the problem?

code:

  protected void processTextPosition(TextPosition text) {
    String character=text.getCharacter(); // is empty
    String font=text.getFont().getBaseFont(); // equal null
}

stream produced with iText: ( dJ� v{d W�cG�)Tj

I speak about these question marks, why do I get the characters in this format?

These question marks appeared in my stream as "SOH-STX-ETX-EOT", not one character. The character inside PDF is shown as 'd' and 'J'!

回答1:

A Type 3 font is a user-defined font. For instance: a user can define that the character 'P' corresponds with the symbol for "The Artist Formerly Known As Prince" (TAFKAP) which is a glyph, but not a letter from any known alphabet.

A glyph in a Type 3 font is a series of lines and shapes, and there's no way for a program such as iText or PDFBox to determine which character was meant. It is only normal that you get a question mark. For instance: which character would you use for this symbol?

One of the following reasons applies for a PDF that contains Type 3 fonts:

The font was used to introduce symbols that don't exist in any font.
The font was used to obfuscate the content of the PDF so that its content can't be extracted.
The PDF wasn't created in an elegant way.

If the Type 3 font was used for normal characters, you'll need to use OCR to convert the content to normal text.