How to get text extraction from PDF to work?

2020-02-07 04:55发布

I need to extract the text from a PDFs in Romanian language. The symbols: ȚțȘșĂăÎîÂâ are not extracted correctly with pdfBox or Snowtide.

Here is a sample file that is not working: ftp://ftp.logos.md/Biblioteca/_Colectie_RO/2nefon.pdf

Any suggestions?

标签: java pdf unicode
2条回答
劫难
2楼-- · 2020-02-07 05:05

I'm afraid the PDF the OP pointed at (2nefon.pdf) does not provide the information required for text extraction according to the spec.

Trying to copy&paste from Adobe Reader results in the special characters being incorrectly exported, and as Adobe Reader contains quite good text extraction capabilities, this already is a bad sign.

Inspecting the file shows the problems. E.g. let's look at the title

Screen shot of the title of 2nefon.pdf

The corresponding segment of the content stream is:

/F1 24 Tf
-148.44 -26.16 TD
(VIA}A  {I  ~NV|}|TURILE) Tj
296.88 0 TD
( ) Tj
-308.16 -29.28 TD
(SFANTULUI  IERARH  NIFON) Tj

Let's check the used font F1:

469 0 obj
<< 
/Type /Font 
/Subtype /TrueType 
/Name /F1 
/BaseFont /TimesR 
/FirstChar 32 
/LastChar 255 
/Widths [ 250 333 444 722 500 833 778 [...] 500 500 500 500 500 500 500 ] 
/Encoding /WinAnsiEncoding 
/FontDescriptor 468 0 R 
>> 
endobj 

Thus, the font claims to use WinAnsiEncoding without changes (no Differences).

A last look at the font descriptor:

468 0 obj
<< 
/Type /FontDescriptor 
/FontName /TimesR 
/Flags 34 
/FontBBox [ -167 -307 1009 913 ] 
/StemV 90 
/ItalicAngle 0 
/CapHeight 913 
/Ascent 913 
/Descent -307 
/FontFile2 474 0 R 
>> 
endobj

No hint here that the afore mentioned WinAnsiEncoding might not be the whole truth.

According to the PDF specification ISO 32000-1

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a)Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b)Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

  • If the font is a composite font [... cut short because the font F1 is no composite font ...]

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(section 9.10.2 Mapping Character Codes to Unicode Values)

So text extraction and copying&pasting are completely following the specification when reporting that the document claims those two lines say:

VIA}A {I ~NV|}|TURILE
SFANTULUI IERARH NIFON

You might want to check, though, whether e.g. Ă (capital A with brevis) is always exported as |; this actually is not unlikely, mapping special characters to character codes of symbols was quite common for a time in the last century. If that indeed is the case, a global search&replace after text extraction gives you the desired text.

查看更多
▲ chillily
3楼-- · 2020-02-07 05:24

How about iText: http://itextpdf.com/

"iText® is an open source library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation."

查看更多
登录 后发表回答