PDFBox 2.0.7 ExtractText not working but 1.8.13 do

2019-02-20 14:59发布

问题:

hopefully you have an idea of what is going wrong with extracting a text from PDF using pdfbox 2.0.7. The result is very strange:

Using 1.8.13, the command java -jar pdfbox-app-1.8.13.jar ExtractText -sort -nonSeq test.pdf leads to

Deutsche Bank Privat- und Geschäftskunden AG

Bruttoertrag 43,80 USD 37,15 EUR
Kapitalertragsteuer (KESt) - 5,36 USD - 4,55 EUR
Solidaritätszuschlag auf KESt - 0,29 USD - 0,25 EUR
Umrechnungskurs USD zu EUR 1,1791000000
Gutschrift mit Wert 15.08.2017 32,35 EUR

Using 2.0.7, the command java -jar pdfbox-app-2.0.7.jar ExtractText -sort test.pdf leads to

aeutsche Bank mrivat- und deschäftskunden Ad

Bruttoertrag QPIUM rpa PTINR bro
hapitaäertragsteuer EhbptF - RIPS rpa - QIRR bro
poäidaritätszuschäag auf hbpt - MIOV rpa - MIOR bro
rmrechnungskurs rpa zu bro NINTVNMMMMMM
dutschrift mit tert NRKMUKOMNT POIPR bro

The debugger with java -jar pdfbox-app-2.0.7.jar PDFDebugger test.pdf shows the correct text in Root/Pages/Kids/[1]/Contents/[1] so somehow the text is read correctly but not exported correctly.

I have tried to compare the information shown in the two PDFDebugger applications but they seem rather identical to me (although I don't know where/what to look for exactly). Unfortunately, I cannot share the PDF document.

I would be happy for any kind of hint of how to solve or even only attack this problem as otherwise I cannot use the newer version of pdfbox. Thanks in advance for your time!

Here is a screenshot of the Font which is used in the document (extracted with 2.0.7). This is exactly the translation of the letters that apparently is not performed:

The entry ToUnicode says

%!PS-Adobe-3.0 Resource-CMap
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /AdHoc-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
68 beginbfchar
<0004> <0021>
<0009> <0026>
<000b> <0028>
<000c> <0029>
<000f> <002c>
<0010> <002d>
<0011> <002e>
<0012> <002f>
<0013> <0030>
<0014> <0031>
<0015> <0032>
<0016> <0033>
<0017> <0034>
<0018> <0035>
<0019> <0036>
<001a> <0037>
<001b> <0038>
<001c> <0039>
<001d> <003a>
<001e> <003b>
<0024> <0041>
<0025> <0042>
<0026> <0043>
<0027> <0044>
<0028> <0045>
<0029> <0046>
<002a> <0047>
<002b> <0048>
<002c> <0049>
<002e> <004b>
<0030> <004d>
<0031> <004e>
<0032> <004f>
<0033> <0050>
<0034> <0051>
<0035> <0052>
<0036> <0053>
<0037> <0054>
<0038> <0055>
<0039> <0056>
<003a> <0057>
<003d> <005a>
<0044> <0061>
<0045> <0062>
<0046> <0063>
<0047> <0064>
<0048> <0065>
<0049> <0066>
<004a> <0067>
<004b> <0068>
<004c> <0069>
<004d> <006a>
<004e> <006b>
<004f> <006c>
<0050> <006d>
<0051> <006e>
<0052> <006f>
<0053> <0070>
<0055> <0072>
<0056> <0073>
<0057> <0074>
<0058> <0075>
<0059> <0076>
<005a> <0077>
<005d> <007a>
<006c> <00e4>
<0081> <00fc>
<0089> <00df>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

The TextView of page 2 of PDF already shows the correct text, but then somehow these replacement tables that are shown above seem to incorrectly modify the text content before it is exported by pdfbox:

Root/Pages/Kids/[1]/Contents/[1]:
=================================
0 Tw
0 Tc
0 0 0 rg
0 0 0 RG
BT
  /F1 10 Tf
  1 0 0 1 69.449 697.11 Tm
  (Wir) Tj
  1 0 0 1 87.199 697.11 Tm
  (\374berweisen) Tj
  1 0 0 1 141.099 697.11 Tm
  (den) Tj
  1 0 0 1 160.549 697.11 Tm
  (Betrag) Tj
  1 0 0 1 192.759 697.11 Tm
  (von) Tj
  1 0 0 1 211.649 697.11 Tm
  (32,35) Tj
  1 0 0 1 239.429 697.11 Tm
  (EUR) Tj
  1 0 0 1 263.299 697.11 Tm
  (auf) Tj
  1 0 0 1 279.959 697.11 Tm
  (Ihr) Tj
  1 0 0 1 294.389 697.11 Tm
  (Konto) Tj
  1 0 0 1 323.269 697.11 Tm
  (XXXXXXX) Tj
  1 0 0 1 364.959 697.11 Tm
  (XX) Tj
  1 0 0 1 376.079 697.11 Tm
  (.) Tj
  0 G
  0 g
ET
69.449 669.448 m
69.449 669.698 l
549.921 669.698 l
549.921 669.448 l
549.921 669.198 l
69.449 669.198 l
h
f
0 0 0 rg
0 0 0 RG
BT
  /F1 6 Tf
  1 0 0 1 249.022 658.948 Tm
  (Kapitalertr\344ge) Tj
  1 0 0 1 288.016 658.948 Tm
  (sind) Tj
  1 0 0 1 300.682 658.948 Tm
  (einkommensteuerpflichtig!) Tj
  1 0 0 1 213.865 652.783 Tm
  (Diese) Tj
  1 0 0 1 230.863 652.783 Tm
  (Mitteilung) Tj
  1 0 0 1 258.187 652.783 Tm
  (wurde) Tj
  1 0 0 1 276.187 652.783 Tm
  (maschinell) Tj
  1 0 0 1 306.187 652.783 Tm
  (erstellt) Tj
  1 0 0 1 325.507 652.783 Tm
  (und) Tj
  1 0 0 1 337.177 652.783 Tm
  (wird) Tj
  1 0 0 1 349.837 652.783 Tm
  (nicht) Tj
  1 0 0 1 364.165 652.783 Tm
  (unterschrieben.) Tj
  0 G
  0 g
ET
q
  1 0 0 1 504.562 772.646 cm
  1 0 0 1 0 0 cm
  q
    0 Tw
    0 Tc
    45.36 0 0 45.36 0 0 cm
    /I0 Do
  Q
Q
0 0 0 rg
0 0 0 RG
BT
  /F1 10.5 Tf
  1 0 0 1 552.756 23.464 Tm
  (2) Tj
  1 0 0 1 558.594 23.464 Tm
  (/) Tj
  1 0 0 1 561.503 23.464 Tm
  (2) Tj
ET
Q
q
0 0 m
0 841.89 l
595.276 841.89 l
595.276 0 l
h
0 0 m
595.276 0 l
595.276 841.89 l
0 841.89 l
h
W
n
Q

1.8.13 shows:

Wir überweisen den Betrag von 32,35 EUR auf Ihr Konto XXXXXXX XX.
Kapitalerträge sind einkommensteuerpflichtig!
Diese Mitteilung wurde maschinell erstellt und wird nicht unterschrieben.
2/2

2.0.7 shows:

tir überweisen den Betrag von POIPR bro auf fhr honto XXXXXXX XX
hapitaäerträge sind einkommensteuerpfäichtig!
aiese jitteiäung wurde maschineää ersteäät und wird nicht unterschriebenK
O/O

This is the file that you were asking for: https://wetransfer.com/downloads/214674449c23713ee481c5a8f529418320170827201941/b2bea6

回答1:

The information about the font in question in your PDF are contradictory and partially broken. Depending on how some software reacts to that it may or may not extract the text correctly.


On the one hand the font has an Encoding value WinAnsiEncoding. This is ok and matches what we see in the content stream, a one-byte encoding covering many of the ANSI codes.

On the other hand we have a ToUnicode map which implies that the underlying encoding is some two-byte encoding (it has a code space range <0000> <ffff>), and even if one ignores the two-byte nature, it has mappings which in particular map digit ANSI codes to uppercase letters, uppercase letter ANSI codes to other lowercase letters, and the lowercase 'l' ANSI code to the Unicode value of 'ä'.

When extracting text, PDFBox 2.0.x seems to follow the broken ToUnicode map (interpreting the two-byte codes in the tabel as one-byte codes, ignoring the upper 0) where possible (resulting in garbage) and else interpret the character code as ANSI (resulting in proper text). PDF 1.8.x seems to have ignored the ToUnicode map, and so does Adobe Reader.


Actually it looks like the ToUnicode map has been made for a font using Identity-H encoding.


If you are confronted with such a PDF and need to extract its text, you can pre-process it and remove the ToUnicode entries; thereafter text extraction should return proper text. E.g.

PDDocument document = PDDocument.load(SOURCE);

for (int pageNr = 0; pageNr < document.getNumberOfPages(); pageNr++)
{
    PDPage page = document.getPage(pageNr);
    PDResources resources = page.getResources();
    removeToUnicodeMaps(resources);
}

PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);

(ExtractText test method testNoToUnicodeTest2)

using helper methods

void removeToUnicodeMaps(PDResources pdResources) throws IOException
{
    COSDictionary resources = pdResources.getCOSObject();

    COSDictionary fonts = asDictionary(resources, COSName.FONT);
    if (fonts != null)
    {
        for (COSBase object : fonts.getValues())
        {
            while (object instanceof COSObject)
                object = ((COSObject)object).getObject();
            if (object instanceof COSDictionary)
            {
                COSDictionary font = (COSDictionary)object;
                font.removeItem(COSName.TO_UNICODE);
            }
        }
    }

    for (COSName name : pdResources.getXObjectNames())
    {
        PDXObject xobject = pdResources.getXObject(name);
        if (xobject instanceof PDFormXObject)
        {
            PDResources xobjectPdResources = ((PDFormXObject)xobject).getResources();
            removeToUnicodeMaps(xobjectPdResources);
        }
    }
}

COSDictionary asDictionary(COSDictionary dictionary, COSName name)
{
    COSBase object = dictionary.getDictionaryObject(name);
    return object instanceof COSDictionary ? (COSDictionary) object : null;
}

(from ExtractText)

You should execute this pre-processing as early as possible after loading the document to prevent the fonts including the wrong ToUnicode mappings to be read into the document font cache.



标签: java pdf pdfbox