I'm trying to replace the content of PDF
in one text but the letter 'X' are not being replaced.
public static void main(String[] args) {
String DEST = "/home/diego/Documentos/teste.pdf";
try {
PdfReader reader = new PdfReader("termoAdesaoCartao.pdf");
PdfDictionary dictionary = reader.getPageN(1);
PdfObject object = dictionary.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
PRStream stream = (PRStream)object;
byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
stamper.close();
reader.close();
} catch (IOException | DocumentException e) {
e.printStackTrace();
}
}
In general
Basically the OP's approach in general cannot work. There are two major misunderstandings his code is built upon:
He assumes that one can translate a complete content stream from byte[]
to String
(with all string parameters of text showing operators being legible) using a single character encoding.
This assumption is wrong: Each font may have its own encoding, so if multiple fonts are used on the same page, the same byte value in string operands of different text showing operators may represent completely different characters. Actually the fonts do not even need to contain a mapping to characters, they merely need to map numeric values to glyph painting instructions.
Cf. section 9.4.3 Text-Showing Operators in ISO 32000-1:
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap,
Simple PDF generators often merely use standard encodings (which are ASCII'ish and may give rise to assumptions like the OP's one) but there are more and more non-simple PDF generators out there...
He assumes he can simply edit the string operands of text-showing operators and the matching glyphs will be shown in the PDF viewer.
This assumption is wrong: Fonts usually only support a fairly limited character set, and a text showing operator uses only a single font, the currently selected one. If one replaces a code in a string argument of such an operator with a different one without a matching glyph in the font, one will at most see a gap!
While complete fonts usually at least contain glyphs for all characters of a kind (e.g. latin letters with all Western European variations thereof), PDF allows embedding fonts partially, cf.section 9.6.4 Font Subsets in ISO 32000-1:
PDF documents may include subsets of Type 1 and TrueType fonts.
This option meanwhile often is used to only embed painting instructions for glyphs actually used in the existing text. Thus, one cannot count on embedded fonts containing all characters of the same kind if they contain some. There may be a glyph for A
and C
but not for B
.
In the case at hand
Unfortunately the OP has not supplied his sample PDF. The symptoms , though:
his call replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z")
makes a difference as can be seen in his screenshot
and his comment to Viacheslav Vedenin's answer
Before the text was (Nome Completo)Tj
and after (A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z)Tj
but some codes do not show as the expected glyphs as can also be seen in the screenshot above
point in the direction that the latter one of his two major false assumption described above makes the OP's code fail him: Most likely the font in question uses a standard encoding (probably WinAnsiEncoding) but is only partially embedded, in particular without the capital letters K
, W
, X
, and Y
.
How to do it correctly
Instead of blindly editing the content stream, the OP (who already is using iText) can use the following iText concepts:
- text extraction classes can be used to also extract coordinates of text, cf multiple answers on stackoverflow, in particular the bounding rectangle of the text he wants to replace;
- the iText xtra library class
PdfCleanUpProcessor
can be used to remove all content existing in that bounding rectangle;
- the
PdfStamper.getOverContent()
can then be used to properly add new content at those coordinates.
This may sound complicated but this takes care of a number of additional minor misconceptions visible in the OP's approach.
Try to use instead of
stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());
following code
stream.setData(new String(data, "UTF8").replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes("UTF8"));
Accoring this post in Oracle manual using new String(data) and getBytes() can lead to some error:
Byte Encodings and Strings
If a byte array contains non-Unicode text, you can convert the text to
Unicode with one of the String constructor methods. Conversely, you
can convert a String object into a byte array of non-Unicode
characters with the String.getBytes method. When invoking either of
these methods, you specify the encoding identifier as one of the
parameters.
The example that follows converts characters between UTF-8 and
Unicode. UTF-8 is a transmission format for Unicode that is safe for
UNIX file systems. The full source code for the example is in the file
StringConverter.java.
Update:
If it isn't working, can you replace code
byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());
to code
byte[] data = PdfReader.getStreamBytes(stream);
String str = new String(data);
System.out.printLn(str);
String newStr = str.replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z");
System.out.printLn(newStr);
stream.setData(newStr.getBytes());
And write what you show in console?
I modified the code found a bit and it was working as follows
public static final String SRC = "C:/tmp/244558.pdf";
public static final String DEST = "C:/tmp/244558-2.pdf";
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new Main().manipulatePdf(SRC, DEST);
}
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfDictionary dict = reader.getPageN(1);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
PdfArray refs = null;
if (dict.get(PdfName.CONTENTS).isArray()) {
refs = dict.getAsArray(PdfName.CONTENTS);
} else if (dict.get(PdfName.CONTENTS).isIndirect()) {
refs = new PdfArray(dict.get(PdfName.CONTENTS));
}
for (int i = 0; i < refs.getArrayList().size(); i++) {
PRStream stream = (PRStream) refs.getDirectObject(i);
byte[] data = PdfReader.getStreamBytes(stream);
stream.setData(new String(data).replace("Data replace", "Data").getBytes());
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
stamper.close();
reader.close();
}