Extract Stream-Dump from PDF-Body with PDFBox

2019-09-06 08:45发布

问题:

i want to extract a Stream-Dump from a PDF with PDFBox. Is this possible with PDFBox?

I want to get the original HEX-Code of the Content of a PDF, like this:

BT /F19 8.9664 Tf 96.197 606.119 Td [(Kommunikation)]TJ
ET
q
1 0 0 1 85.238 594.35 cm
[]0 d 0 J 0.398 w 0 0 m 0 7.352 l S
Q
BT
/F19 8.9664 Tf 133.856 595.758 Td [(Erster)-600(Testuebertrag)-600(auf)-600(die)-600(Neuentwicklung)-600(fuer)-600(die)-600(PSA)-600(Direktbank)-600(ma)]TJ
ET
q
1 0 0 1 85.238 583.989 cm
[]0 d 0 J 0.398 w 0 0 m 0 7.352 l S
Q
BT
/F19 8.9664 Tf 133.856 585.397 Td [(l)-600(mit)-600(sehr)-600(langen)-600(Verwendungszweck)-600(gleich)-600(zum)-600(testen)-600(wann)-600(dieser)-600(cuted)]TJ
ET

thx

回答1:

For a single use, run PDFDebugger and look for "Contents".

For multiple use, use this code for the first page:

try (PDDocument doc = PDDocument.load(new File("XXX.pdf")); 
        InputStream contents = doc.getPage(0).getContents())
{
    IOUtils.copy(contents, System.out);
}

Note that this will only dump the page content stream. There may be other content streams in xobject forms, patterns, soft masks, annotation appearance streams. PDF is quite complex.