I'm trying to use iText PDFSweep RegexBasedCleanupStrategy to redact some words from pdf, however I only want to redact the word but not appear in other word, eg.
I want to redact "al" as single word, but I don't want to redact the "al" in "mineral".
So I add the word boundary("\b") in the Regex as parameter to RegexBasedCleanupStrategy,
new RegexBasedCleanupStrategy("\\bal\\b")
however the pdfAutoSweep.cleanUp not work if the word is at the end of line.
In short
The cause of this issue is that the routine that flattens the extracted text chunks into a single String
for applying the regular expression does not insert any indicator for a line break. Thus, in that String
the last letter from one line is immediately followed by the first letter of the next which hides the word boundary. One can fix the behavior by adding an appropriate character to the String
in case of a line break.
The problematic code
The routine that flattens the extracted text chunks into a single String
is CharacterRenderInfo.mapString(List<CharacterRenderInfo>)
in the package com.itextpdf.kernel.pdf.canvas.parser.listener
. In case of a merely horizontal gap this routine inserts a space character but in case of a vertical offset, i.e. a line break, it adds nothing extra to the StringBuilder
in which the String
representation is generated:
if (chunk.sameLine(lastChunk)) {
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
sb.append(' ');
}
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
} else {
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
}
A possible fix
One can extend the code above to insert a newline character in case of a line break:
if (chunk.sameLine(lastChunk)) {
// we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
sb.append(' ');
}
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
} else {
sb.append('\n');
indexMap.put(sb.length(), i);
sb.append(chunk.getText());
}
This CharacterRenderInfo.mapString
method is only called from the RegexBasedLocationExtractionStrategy
method getResultantLocations()
(package com.itextpdf.kernel.pdf.canvas.parser.listener
), and only for the task mentioned, i.e. applying the regular expression in question. Thus, enabling it to properly allow recognition of word boundaries should not break anything but indeed should be considered a fix.
One merely might consider adding a different character for a line break, e.g. a plain space ' '
if one does not want to treat vertical gaps any different than horizontal ones. For a general fix one might, therefore, consider making this character a settable property of the strategy.
Versions
I tested with iText 7.1.4-SNAPSHOT and PDFSweep 2.0.3-SNAPSHOT.