Traverse whole PDF and change some attribute with

2020-05-09 12:13发布

I'm working on a filter program which turns each black text block into gray ones in a PDF file. I have gone through com.itextpdf.text.pdf.parser and can't found something suitable for this function.

PS: I'm using iTextSharp 5.5.10, for which I can't find an appropriate document. Documents for iText5 seems to work at most times, but there's still difference. Is there any document for iTextSharp?

标签: pdf itext
2条回答
叛逆
2楼-- · 2020-05-09 12:31

Use this approach to change color. I am using below code to change hyperlink colors.

            PdfCanvasEditor editor = new PdfCanvasEditor() {
        @Override
        protected void write(PdfCanvasProcessor processor, PdfLiteral operator, List<PdfObject> operands)
        {
            String operatorString = operator.toString();

            if (SET_FILL_RGB.equals(operatorString) && operands.size() == 4) {
                if (isApproximatelyEqual(operands.get(0), 0) &&
                        isApproximatelyEqual(operands.get(1), 0) &&
                        isApproximatelyEqual(operands.get(2), 1)) {
                    super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
                    return;
                }
            }

            if (SET_STROKE_RGB.equals(operatorString) && operands.size() == 4) {
                if (isApproximatelyEqual(operands.get(0), 0) &&
                        isApproximatelyEqual(operands.get(1), 0) &&
                        isApproximatelyEqual(operands.get(2), 1)) {
                    super.write(processor, new PdfLiteral("G"), Arrays.asList(new PdfNumber(0), new PdfLiteral("G")));
                    return;
                }
            }

            super.write(processor, operator, operands);
        }

        boolean isApproximatelyEqual(PdfObject number, float reference) {
            return number instanceof PdfNumber && Math.abs(reference - ((PdfNumber)number).floatValue()) < 0.01f;
        }

        final String SET_FILL_RGB = "rg";
        final String SET_STROKE_RGB = "RG";
    };
    for (int i = 1; i <= pdfDocument.getNumberOfPages(); i++) {
        editor.editPage(pdfDocument, i);
    }
}
查看更多
爷的心禁止访问
3楼-- · 2020-05-09 12:51

The OP clarified his question in a comment:

I'm wondering how to write a parser like PdfTextExtractor or something else. I was excepting something like BaseParser or so but found nothing. So I missed my way about it.

If you are in search for something like an editing framework, you can use the PdfContentStreamEditor presented in this answer.

Based on the PdfContentStreamEditor you can edit the content stream of the PDF pages like this:

PdfReader pdfReader = new PdfReader(resource);
PdfStamper pdfStamper = new PdfStamper(pdfReader, result);
PdfContentStreamEditor editor = new PdfContentStreamEditor()
{
    @Override
    protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException
    {
        String operatorString = operator.toString();

        if (TEXT_SHOWING_OPERATORS.contains(operatorString))
        {
            if (currentlyReplacedBlack == null)
            {
                BaseColor currentFillColor = gs().getFillColor();
                if (BaseColor.BLACK.equals(currentFillColor))
                {
                    currentlyReplacedBlack = currentFillColor;
                    super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(1), new PdfNumber(0), new PdfLiteral("rg")));
                }
            }
        }
        else if (currentlyReplacedBlack != null)
        {
            if (currentlyReplacedBlack instanceof CMYKColor)
            {
                super.write(processor, new PdfLiteral("k"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfNumber(1), new PdfLiteral("k")));
            }
            else if (currentlyReplacedBlack instanceof GrayColor)
            {
                super.write(processor, new PdfLiteral("g"), Arrays.asList(new PdfNumber(0), new PdfLiteral("g")));
            }
            else
            {
                super.write(processor, new PdfLiteral("rg"), Arrays.asList(new PdfNumber(0), new PdfNumber(0), new PdfNumber(0), new PdfLiteral("rg")));
            }
            currentlyReplacedBlack = null;
        }

        super.write(processor, operator, operands);
    }

    BaseColor currentlyReplacedBlack = null;

    final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
};

for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
    editor.editPage(pdfStamper, i);
}

pdfStamper.close();

(ChangeTextColor.java test testChangeBlackTextToGreenDocument)

In PdfContentStreamEditor the method write is called for each instruction in the content stream and writes it back. By overriding this method and forwarding partially different instructions to the superclass write, one can edit the stream.

This implementation shows how to change the color of text of a given color. In this case, black text is changed to green.

Beware, this is merely a proof-of-concept, not a final and complete solution. In particular

  • Text is considered to be black if for its color the expression BaseColor.BLACK.equals(color) is true; as equality among BaseColor and its descendant classes is not completely well-defined, this might lead to some false positives.
  • PdfContentStreamEditor only inspects and edits the content stream of the page itself, not the content streams of displayed form xobjects or patterns; thus, some text may not be found.

Improving the class to properly detect black color and to recursively traverse and edit the content streams of used patterns and xobjects remains as an exercise for the reader.

查看更多
登录 后发表回答