How to get text extraction from PDF to work?

I need to extract the text from a PDFs in Romanian language. The symbols: ȚțȘșĂăÎîÂâ are not extracted correctly with pdfBox or Snowtide.

Here is a sample file that is not working: ftp://ftp.logos.md/Biblioteca/_Colectie_RO/2nefon.pdf

Any suggestions?

标签： java pdf unicode

2条回答

劫难

2楼-- · 2020-02-07 05:05

I'm afraid the PDF the OP pointed at (2nefon.pdf) does not provide the information required for text extraction according to the spec.

Trying to copy&paste from Adobe Reader results in the special characters being incorrectly exported, and as Adobe Reader contains quite good text extraction capabilities, this already is a bad sign.

Inspecting the file shows the problems. E.g. let's look at the title

Screen shot of the title of 2nefon.pdf

The corresponding segment of the content stream is:

/F1 24 Tf
-148.44 -26.16 TD
(VIA}A  {I  ~NV|}|TURILE) Tj
296.88 0 TD
( ) Tj
-308.16 -29.28 TD
(SFANTULUI  IERARH  NIFON) Tj

Let's check the used font F1:

469 0 obj
<< 
/Type /Font 
/Subtype /TrueType 
/Name /F1 
/BaseFont /TimesR 
/FirstChar 32 
/LastChar 255 
/Widths [ 250 333 444 722 500 833 778 [...] 500 500 500 500 500 500 500 ] 
/Encoding /WinAnsiEncoding 
/FontDescriptor 468 0 R 
>> 
endobj

Thus, the font claims to use WinAnsiEncoding without changes (no Differences).

A last look at the font descriptor:

468 0 obj
<< 
/Type /FontDescriptor 
/FontName /TimesR 
/Flags 34 
/FontBBox [ -167 -307 1009 913 ] 
/StemV 90 
/ItalicAngle 0 
/CapHeight 913 
/Ascent 913 
/Descent -307 
/FontFile2 474 0 R 
>> 
endobj

No hint here that the afore mentioned WinAnsiEncoding might not be the whole truth.

According to the PDF specification ISO 32000-1

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

a)Map the character code to a character name according to Table D.1 and the font’s Differences array.

b)Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

If the font is a composite font [... cut short because the font F1 is no composite font ...]

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

(section 9.10.2 Mapping Character Codes to Unicode Values)

So text extraction and copying&pasting are completely following the specification when reporting that the document claims those two lines say:

VIA}A {I ~NV|}|TURILE
SFANTULUI IERARH NIFON

You might want to check, though, whether e.g. Ă (capital A with brevis) is always exported as |; this actually is not unlikely, mapping special characters to character codes of symbols was quite common for a time in the last century. If that indeed is the case, a global search&replace after text extraction gives you the desired text.

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2020-02-07 05:24

How about iText: http://itextpdf.com/

"iText® is an open source library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation."

0人赞添加讨论(0) 举报

How to get text extraction from PDF to work?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间