Traverse whole PDF and change blue color to black

2020-05-03 12:59发布

问题:

I want to change the color of text from blue to black and also wants to remove underline as well. But from only those text which contains "http//" & "https//"

Refrence Links:

Traverse whole PDF and change blue color to black ( Change color of underlines as well) + iText

Traverse whole PDF and Remove underlines of hyperlinks (annotations) only + iText

回答1:

Presenting the complete code of a solution for this task would be beyond the scope of a stack overflow answer. Thus, I'll merely outline here one approach to implement a solution.

Hindrances

The task is more difficult than one might be aware of.

In particular the text of a link is not necessarily drawn using a few consecutive text showing operations (let alone a single one). In the worst case each letter of the link could be drawn in a separate instructions with all these instructions spread in a random order all over the content stream with operations drawing non-link content in-between.

Thus, you cannot look at each content stream instruction by itself and decide immediately what to do with it as was possible in the previous approaches you referenced in your question. Instead you'll have to collect all text and line drawing instructions with their context, sort them in the on page order, find URL texts and nearby lines there-in, manipulate the underlying instructions, and then write out the page content.

Furthermore, the recognition of "blue" in the referenced answers will not yet catch every shade of blue; only RGB colorspace blues are considered there but a blue tint might be generated by other color spaces, too. Also the text may be initially drawn in a different color and have it changed by some overlay. Furthermore, these colorspaces need not necessarily contain a black tint. Thus, the manipulation of the underlying instructions for a general solution is more difficult than simply changing the color value before the recognized link text pieces and lines.

An implementation approach

A solution taking those hindrances into account can still be built based on the PdfCanvasEditor used in the referenced answers (this and this) borrowed from this answer. In contrast to solutions there, though, the instructions must be collected in the write method with some relevant information of the state at the time of their execution, in particular the text and text position for text drawing instructions and the line position for line drawing instructions, and the color.

The iText LocationTextExtractionStrategy already does that, merely without keeping the original instructions in mind. Thus, you can borrow code from that strategy or even integrate it (instead of the dummy render listener by default used in the PdfCanvasEditor) and merely have to reference the corresponding instructions from the text chunks processed by the strategy class.

When all the instructions of the page have been collected with those extra information, you have to sort the text. The LocationTextExtractionStrategy also contains code to sort the text chunks accordingly which you can now use for your task.

In those sorted text chunks you can now look for link texts. Having found them, you can visit all the text drawing instructions associated with those chunks and all the line drawing instruction with positions right under those chunks, check their color for blueness, and (if blue) envelop them in a "change to black color" and "change back to previous color again" instructions bracket.

To also recognize wilder ways to create blue text, you have to improve your analysis of the instructions even more. E.g. if in blend mode Lighten later an area including some text is filled in blue, an originally black-on-white text suddenly becomes blue-on-white.

A possible generalization

This approach actually would give rise to a more generic PDF text manipulator if you somehow exposed the sorted text chunks and created a more flexible interface with methods for a number of changes to apply to the underlying instructions.

As of the approach above will take quite a number of weeks for a solid implementation anyways, you may want to consider such a more generic architecture for possible later re-use and sharing.



标签: pdf itext