Getting Text Colour with PDFBox

2019-01-26 02:09发布

问题:

I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.

Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?

Many thanks.

回答1:

All color informations should be stored in the class PDGraphicsState and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).

Here is a small sample I tried:

After creating a pdf with just one line ("Sample" written in RGB=[146,208,80]), the following program will output:

DeviceRGB 146.115 208.08 80.07

Here's the code:

PDDocument doc = null;
try {
    doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
    PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
    PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
    engine.processStream(page, page.findResources(), page.getContents().getStream());
    PDGraphicsState graphicState = engine.getGraphicsState();
    System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
    float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
    for (float c : colorSpaceValues) {
        System.out.println(c * 255);
    }
}
finally {
    if (doc != null) {
        doc.close();
    }

Take a look at PageDrawer.properties to see how PDF operators are mapped to Java classes.

As I understand it, as PDFStreamEngine processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So for CS it calls org.apache.pdfbox.util.operator.SetStrokingColorSpace as defined by mapping CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace in the .properties file. RG is mapped to org.apache.pdfbox.util.operator.SetStrokingRGBColor and so on.

In this case, the PDGraphicsState hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extend PDFStreamEngine (just like PageDrawer, PDFTextStripper and other classes do) to do something when color changes. You could also write your own mappings in your own .properties file.



标签: java pdfbox