I have some code that uses the Java Apache POI library to open a Microsoft word document and convert it to html, using the the Apache POI and it also gets the byte array data of images on the document. But I need to convert this information to html to write out to an html file. Any hints or suggestions would be appreciated. Keep in mind that I am a desktop dev developer and not a web programmer, so when you make suggestions, please remember that. The code below gets the image.
private void parseWordText(File file) throws IOException {
FileInputStream fs = new FileInputStream(file);
doc = new HWPFDocument(fs);
PicturesTable picTable = doc.getPicturesTable();
if (picTable != null){
picList = new ArrayList<Picture>(picTable.getAllPictures());
if (!picList.isEmpty()) {
for (Picture pic : picList) {
byte[] byteArray = pic.getContent();
pic.suggestFileExtension();
pic.suggestFullFileName();
pic.suggestPictureType();
pic.getStartOffset();
}
}
}
Then the code below this converts the document to html. Is there a way to add the byteArray to the ByteArrayOutputStream in the code below?
private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
HWPFDocumentCore wordDocument = null;
try {
wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
NamedNodeMap node = htmlDocument.getAttributes();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
acDocTextArea.setText(newDocText);
htmlText = result;
}
@user4887078 It's straight forward just as @Guga said, all I did was to look org.apache.poi.xwpf.converter.core.FileImageExtractor and Voila! It sure works as expected, although it might still need some refactoring and optimization.
Use this should be useful.
Looking at the source code for the
org.apache.poi.hwpf.converter.WordToHtmlConverter
athttp://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740
It states in the JavaDoc:
This implementation doesn't create images or links to them. This can be changed by overriding {@link #processImage(Element, boolean, Picture)} method
If you take a look at that
processImage(...)
method in AbstractWordConverter.java at line 790, it looks like the method is calling then another method namedprocessImageWithoutPicturesManager(...)
.http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740
This method is defined in
WordToHtmlConverter
again and looks suspiciously exact like the place you want to grow your code (line 317):I think you have the point where to start inserting the images into the flow.
Create a subclass of the converter, e.g.
and then override the method and place whatever code into it.
I haven't tested it, but it should be the right way from what I see theoretically.