我有一个使用了Java的Apache POI库中打开一个Word文档,并将其转换为HTML,使用Apache的POI一些代码,它也沾到了文档图像的字节数组数据。 但我需要这些信息转换为HTML写出来到HTML文件。 任何提示或建议,将不胜感激。 请记住,我是一个桌面开发开发商,而不是一个Web程序员,所以当你提出建议,请记住这一点。 下面的代码获取图像。
private void parseWordText(File file) throws IOException {
FileInputStream fs = new FileInputStream(file);
doc = new HWPFDocument(fs);
PicturesTable picTable = doc.getPicturesTable();
if (picTable != null){
picList = new ArrayList<Picture>(picTable.getAllPictures());
if (!picList.isEmpty()) {
for (Picture pic : picList) {
byte[] byteArray = pic.getContent();
pic.suggestFileExtension();
pic.suggestFullFileName();
pic.suggestPictureType();
pic.getStartOffset();
}
}
}
然后下面这个代码转换为HTML文档。 有没有一种办法的ByteArray添加到ByteArrayOutputStream在下面的代码?
private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
HWPFDocumentCore wordDocument = null;
try {
wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
} catch (IOException ex) {
Exceptions.printStackTrace(ex);
}
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
NamedNodeMap node = htmlDocument.getAttributes();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
acDocTextArea.setText(newDocText);
htmlText = result;
}
综观对源代码org.apache.poi.hwpf.converter.WordToHtmlConverter
在
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740
它指出在JavaDoc:
此实现不创建图片或链接到他们。 这可以通过重写{@link #processImage(元,布尔值,图片)}方法来改变
如果你看看那个processImage(...)
在AbstractWordConverter.java方法在行790,它看起来像方法调用,然后命名另一种方法processImageWithoutPicturesManager(...)
http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740
这种方法在定义WordToHtmlConverter
一次看上去非常精确像你想增加你的代码(线317)的地方:
@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
boolean inlined, Picture picture)
{
// no default implementation -- skip
currentBlock.appendChild(htmlDocumentFacade.document
.createComment("Image link to '"
+ picture.suggestFullFileName() + "' can be here"));
}
我想你的地步,启动图像插入流。
创建转换器,例如一个子类
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter
然后覆盖的方法和地点的任何代码到它。
我没有测试它,但它应该从我所看到的理论上的正确途径。
@ user4887078它是直线前进,就像@Guga说,我所做的就是寻找org.apache.poi.xwpf.converter.core.FileImageExtractor 瞧! 它肯定的作品如预期,但仍可能需要一些重构和优化。
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(is);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.setPicturesManager(new PicturesManager() {
@Override
public String savePicture(byte[] bytes, PictureType pictureType, String s, float v, float v1) {
File imageFile = new File("pages/imgs", s);
imageFile.getParentFile().mkdirs();
InputStream in = null;
FileOutputStream out = null;
try {
in = new ByteArrayInputStream(bytes);
out = new FileOutputStream(imageFile);
IOUtils.copy(in, out);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (in != null) {
IOUtils.closeQuietly(in);
}
if (out != null) {
IOUtils.closeQuietly(out);
}
}
return "imgs/" + imageFile.getName();
}
});
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
FileOutputStream fos = new FileOutputStream(outFile);
使用这应该是有用的。
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter{
public InlineImageWordToHtmlConverter(Document document) {
super(document);
}
@Override
protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture) {
Element img = super.getDocument().createElement("img");
img.setAttribute("src", "data:image/png;base64,"+Base64.getEncoder().encodeToString(picture.getContent()));
currentBlock.appendChild(img);
}
}