I have some code that uses the Java Apache POI library to open a Microsoft word document and convert it to html, using the the Apache POI and it also gets the byte array data of images on the document. But I need to convert this information to html to write out to an html file. Any hints or suggestions would be appreciated. Keep in mind that I am a desktop dev developer and not a web programmer, so when you make suggestions, please remember that. The code below gets the image.
private void parseWordText(File file) throws IOException {
FileInputStream fs = new FileInputStream(file);
doc = new HWPFDocument(fs);
PicturesTable picTable = doc.getPicturesTable();
if (picTable != null){
picList = new ArrayList<Picture>(picTable.getAllPictures());
if (!picList.isEmpty()) {
for (Picture pic : picList) {
byte[] byteArray = pic.getContent();
pic.suggestFileExtension();
pic.suggestFullFileName();
pic.suggestPictureType();
pic.getStartOffset();
}
}
}
Then the code below this converts the document to html. Is there a way to add the byteArray to the ByteArrayOutputStream in the code below?
private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
HWPFDocumentCore wordDocument = null;
try {
wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
NamedNodeMap node = htmlDocument.getAttributes();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
acDocTextArea.setText(newDocText);
htmlText = result;
}
Looking at the source code for the org.apache.poi.hwpf.converter.WordToHtmlConverter
at
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740
It states in the JavaDoc:
This implementation doesn't create images or links to them. This can be
changed by overriding {@link #processImage(Element, boolean, Picture)} method
If you take a look at that processImage(...)
method in AbstractWordConverter.java at line 790, it looks like the method is calling then another method named processImageWithoutPicturesManager(...)
.
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740
This method is defined in WordToHtmlConverter
again and looks suspiciously exact like the place you want to grow your code (line 317):
@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
boolean inlined, Picture picture)
{
// no default implementation -- skip
currentBlock.appendChild(htmlDocumentFacade.document
.createComment("Image link to '"
+ picture.suggestFullFileName() + "' can be here"));
}
I think you have the point where to start inserting the images into the flow.
Create a subclass of the converter, e.g.
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter
and then override the method and place whatever code into it.
I haven't tested it, but it should be the right way from what I see theoretically.
@user4887078 It's straight forward just as @Guga said, all I did was to look org.apache.poi.xwpf.converter.core.FileImageExtractor and
Voila!
It sure works as expected, although it might still need some refactoring and optimization.
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(is);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager(new PicturesManager() {
@Override
public String savePicture(byte[] bytes, PictureType pictureType, String s, float v, float v1) {
File imageFile = new File("pages/imgs", s);
imageFile.getParentFile().mkdirs();
InputStream in = null;
FileOutputStream out = null;
try {
in = new ByteArrayInputStream(bytes);
out = new FileOutputStream(imageFile);
IOUtils.copy(in, out);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (in != null) {
IOUtils.closeQuietly(in);
}
if (out != null) {
IOUtils.closeQuietly(out);
}
}
return "imgs/" + imageFile.getName();
}
});
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
FileOutputStream fos = new FileOutputStream(outFile);
Use this should be useful.
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter{
public InlineImageWordToHtmlConverter(Document document) {
super(document);
}
@Override
protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture) {
Element img = super.getDocument().createElement("img");
img.setAttribute("src", "data:image/png;base64,"+Base64.getEncoder().encodeToString(picture.getContent()));
currentBlock.appendChild(img);
}
}