I have PDF file in Arabic that has text with font Type3 when I extract text using PDFBox some characters are empty and their font equals null? I want to know what is the problem?
code:
protected void processTextPosition(TextPosition text) {
String character=text.getCharacter(); // is empty
String font=text.getFont().getBaseFont(); // equal null
}
stream produced with iText: ( dJ� v{d W�cG�)Tj
I speak about these question marks, why do I get the characters in this format?
These question marks appeared in my stream as "SOH-STX-ETX-EOT", not one character. The character inside PDF is shown as 'd' and 'J'!
A Type 3 font is a user-defined font. For instance: a user can define that the character 'P' corresponds with the symbol for "The Artist Formerly Known As Prince" (TAFKAP) which is a glyph, but not a letter from any known alphabet.
A glyph in a Type 3 font is a series of lines and shapes, and there's no way for a program such as iText or PDFBox to determine which character was meant. It is only normal that you get a question mark. For instance: which character would you use for this symbol?
One of the following reasons applies for a PDF that contains Type 3 fonts:
- The font was used to introduce symbols that don't exist in any font.
- The font was used to obfuscate the content of the PDF so that its content can't be extracted.
- The PDF wasn't created in an elegant way.
If the Type 3 font was used for normal characters, you'll need to use OCR to convert the content to normal text.