I use apache pdfbox 2.0.0 version in my java code (java 1.6). I'm trying to figure out how I can get, replace and save back to my pdf a data from
<stream> data here... <endstream> ?
My pdf file looks like:
596 0 obj
<<
/Filter /FlateDecode
/Length 3739
>>
stream
xњ[ЫnЬF}џoШ8эІАђhЮ/‰`@С%Hvќd-н“іXPJГ ...
endstream
endobj
I've found a solution how I can decode this stream. I used a "WriteDecodedDoc" command from the pdfbox-app-1.8.10.jar api. So now I have two variant of the file but I have NO idea how I can work with this stream. This stream contains footer and header where were placed images and text.
I checked my file with PDFTextStripper class. It can see necessary data from streams but I can't use this class in case of replacement and saving data back to pdf file.
I tried replace this text just open a file as text, search text, replace it only in stream and save. But I have a problem with "Cannot extract the embedded font...". The main reason is that I loose an encoding. I tried change this encoding but it didn't help me.
BTW I can't use iText. I should use free libs here.
Thanks for any solution.
Edit:
after decoding I have the stream like
stream
/CS0 CS 0.412 0.416 0.423 SCN
0.25 w
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Span <</Lang (en-US)/MCID 83 >>BDC
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC
( )Tj
endstream
I need to replace a link to a different link inside stream. This one:
[(www)11(.li)-14.9(nkto)-10(thesi)-8(tesho)-7.9(ouldbehere)15.1(.com)]TJ
EDIT 2 code
public static void replaceLinksInPdf(String filePath) {
PDDocument document = null;
try {
document = PDDocument.load(new File(filePath));
if (document.isEncrypted()) {
document.setAllSecurityToBeRemoved(true);
System.out.println(filePath + " Doc was decrypted");
}
// COSBase cosb = document.getDocument().getObjects().get(27);
// e.g. this object contains <stream> bytecode <endstream> in the PDF file.
// it looks that
// document -> getDocument() -> objectPool #27 -> baseObject -> randomAccess -> bufferList size 10 has a data that I can't open and work
// document -> getDocument() -> objectPool #27 -> baseObject -> items -> all PDF's tag but NO a stream section
int pageNum = 0;
for (PDPage page : document.getPages()) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
List<Object> newTokens = new ArrayList<Object>();
for (Object token : tokens) {
if (token instanceof Operator) {
COSDictionary dictionary = ((Operator) token).getImageParameters();
if (dictionary != null) {
System.out.println(dictionary.toString());
}
}
if (token instanceof Operator) {
Operator op = (Operator) token;
if (op.getName().equals("Tj")) {
// Tj contains 1 COSString
COSString previous = (COSString) newTokens.get(newTokens.size() - 1);
String string = previous.getString();
// check if string contains a necessary link
if (string.equals("www.linkhouldbehere.com")) {
COSArray newLink = new COSArray();
newLink.add(new COSString("test2.test2.com"));
newTokens.set(newTokens.size() - 1, newLink);
}
} else if (op.getName().equals("TJ")) {
// TJ contains a COSArray with COSStrings and COSFloat (padding)
COSArray previous = (COSArray) newTokens.get(newTokens.size() - 1);
String string = "";
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String content = cosString.getString();
string += content;
}
}
// check if string contains a necessary link
if (string.equals("www.linkhouldbehere.com")) {
COSArray newLink = new COSArray();
newLink.add(new COSString("test.test.com"));
newTokens.set(newTokens.size() - 1, newLink);
} else if (string.startsWith("www.linkhouldbehere.com")) {
// some magic here to remove all indents and show new link from beginning.
// no rules. Just for test and it works here
COSArray newLink = (COSArray) newTokens.get(newTokens.size() - 1);
int size = newLink.size();
float f = ((COSFloat) newLink.get(size - 4)).floatValue();
for (int i = 0; i < size - 4; i++) {
newLink.remove(0);
}
newLink.set(0, new COSString("test.test.com"));
// number for padding of date from right place. Should be checked.
newLink.set(1, new COSFloat(f - 8000));
newTokens.set(newTokens.size() - 1, newLink);
}
}
}
newTokens.add(token);
}
// save replaced content inside a page
PDStream newContents = new PDStream(document);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(newTokens);
out.close();
page.setContents(newContents);
// replace all links that have a pop-up line
pageNum++;
List<PDAnnotation> annotations = page.getAnnotations();
for (PDAnnotation annotation : annotations) {
PDAnnotation annot = annotation;
if (annot instanceof PDAnnotationLink) {
PDAnnotationLink link = (PDAnnotationLink) annot;
PDAction action = link.getAction();
if (action instanceof PDActionURI) {
PDActionURI uri = (PDActionURI) action;
String newURI = "www.test1.test1.com";
uri.setURI(newURI);
}
}
}
}
// save file
document.save(filePath.replace("file", "file_result"));
} catch (IOException e) {
e.printStackTrace();
} finally {
if (document != null) {
try {
document.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
EDIT 3.
The pdf contains the 660 0 obj that contains a necessary link inside:
660 0 obj
<<
/BBox [0.0 792.0 612.0 0.0]
/Length 792
/Matrix [1.0 0.0 0.0 1.0 0.0 0.0]
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
>>
/Font <<
/T1_0 834 0 R
/T1_1 835 0 R
/T1_2 836 0 R
>>
/ProcSet [/PDF /Text]
>>
/Subtype /Form
>>
stream
/CS0 CS 0.412 0.416 0.423 SCN
0.25 w
/GS0 gs
q 1 0 0 1 72 78.425 cm
0 0 m
468 0 l
S
Q
/Artifact <</O /Layout >>BDC
BT
/CS0 cs 0.412 0.416 0.423 scn
/T1_0 1 Tf
0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 8 0 0 8 72 64.8 Tm
[(Visit )35(O)7(ur site R)23.1(esear)15.1(ch Manager )20.1(on )20(the )12(web at )]TJ
ET
EMC
/Artifact <</O /Layout >>BDC
BT
/T1_1 1 Tf
8 0 0 8 237.0609 64.8 Tm
[(www)11(.lin)-14.9(kshou)-10(ldbeh)-8(ere)-7.9(ninechars)15.1(.com)]TJ
/Span<</ActualText<FEFF0009>>> BDC
( )Tj
EMC
31.954 0 Td
[(A)15(ugust 7)45.1(,)-5( 2015)]TJ
ET
EMC
/Artifact <</O /Layout >>BDC
BT
/T1_0 1 Tf
8 0 0 8 540 64.8 Tm
( )Tj
ET
EMC
/Artifact <</O /Layout >>BDC
BT
/T1_2 1 Tf
7 0 0 7 72 55.3 Tm
[(\251 2015 )29(CCH Incorporated and its af\037liates. )38.3(All rights r)12(eserv)8.1(ed.)]TJ
ET
EMC
endstream
and only one place I found where it is called from the pdf file. It is from 45 0 obj
/XObject <<
/Fm0 660 0 R
/Fm1 661 0 R
>>
a full text from obj:
45 0 obj
<<
/ArtBox [0.0 0.0 612.0 792.0]
/BleedBox [0.0 0.0 612.0 792.0]
/Contents 658 0 R
/CropBox [0.0 0.0 612.0 792.0]
/Group 659 0 R
/MediaBox [0.0 0.0 612.0 792.0]
/Parent 13 0 R
/Resources <<
/ColorSpace <<
/CS0 [/ICCBased 21 0 R]
>>
/ExtGState <<
/GS0 22 0 R
/GS1 23 0 R
>>
/Font <<
/T1_0 597 0 R
/T1_1 26 0 R
/T1_2 28 0 R
/T1_3 25 0 R
>>
/ProcSet [/PDF /Text]
/XObject <<
/Fm0 660 0 R
/Fm1 661 0 R
>>
>>
/Rotate 0
/StructParents 22
/Tabs /W
/Thumb 662 0 R
/TrimBox [0.0 0.0 612.0 792.0]
/Type /Page
/Annots []
>>
endobj
A question is Can I get this 660 0 obj and process it by PDFBox? Because it looks like PDFStreamParser parser doesn't know anythig about this 660 0 object. Thank you.