I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.
Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?
Many thanks.
All color informations should be stored in the class PDGraphicsState
and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).
Here is a small sample I tried:
After creating a pdf with just one line ("Sample" written in RGB=[146,208,80]
), the following program will output:
DeviceRGB
146.115
208.08
80.07
Here's the code:
PDDocument doc = null;
try {
doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
engine.processStream(page, page.findResources(), page.getContents().getStream());
PDGraphicsState graphicState = engine.getGraphicsState();
System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
for (float c : colorSpaceValues) {
System.out.println(c * 255);
}
}
finally {
if (doc != null) {
doc.close();
}
Take a look at PageDrawer.properties
to see how PDF operators are mapped to Java classes.
As I understand it, as PDFStreamEngine
processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So for CS
it calls org.apache.pdfbox.util.operator.SetStrokingColorSpace
as defined by mapping CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace
in the .properties
file. RG
is mapped to org.apache.pdfbox.util.operator.SetStrokingRGBColor
and so on.
In this case, the PDGraphicsState
hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extend PDFStreamEngine
(just like PageDrawer
, PDFTextStripper
and other classes do) to do something when color changes. You could also write your own mappings in your own .properties
file.