docx4j conversion html->docx->html

2019-06-09 19:22发布

I'm working on my first project using docx4j... My goal is to export xhtml from a webapp (ckeditor created html) into a docx, edit it in Word, then import it back into the ckeditor wysiwyg.

(*crosspost from http://www.docx4java.org/forums/xhtml-import-f28/html-docx-html-inserts-a-lot-of-space-t1966.html#p6791?sid=78b64a02482926c4dbdbafbf50d0a914 will update when answered)

I have created an html test document with the following contents:

<html><ul><li>TEST LINE 1</li><li>TEST LINE 2</li></ul></html>

My code then creates a docx from this html like so: WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage .createPackage();

    NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
    wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
    ndp.unmarshalDefaultNumbering();

    XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
    xHTMLImporter.setHyperlinkStyle("Hyperlink");

    wordMLPackage.getMainDocumentPart().getContent()
            .addAll(xHTMLImporter.convert(new File("test.html"), null));

    System.out.println(XmlUtils.marshaltoString(wordMLPackage
            .getMainDocumentPart().getJaxbElement(), true, true));

    wordMLPackage.save(new java.io.File("test.docx"));

My code then attempts to convert the docx BACK to html like so: WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage .createPackage();

    NumberingDefinitionsPart ndp = new NumberingDefinitionsPart();
    wordMLPackage.getMainDocumentPart().addTargetPart(ndp);
    ndp.unmarshalDefaultNumbering();

    XHTMLImporterImpl xHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
    xHTMLImporter.setHyperlinkStyle("Hyperlink");

    WordprocessingMLPackage docx = WordprocessingMLPackage.load(new File("test.docx"));
    AbstractHtmlExporter exporter = new HtmlExporterNG2();
    OutputStream os = new java.io.FileOutputStream("test.html");
    HTMLSettings htmlSettings = new HTMLSettings();
    javax.xml.transform.stream.StreamResult result = new javax.xml.transform.stream.StreamResult(
            os);
    exporter.html(docx, result, htmlSettings);

The html returned is:

<?xml version="1.0" encoding="UTF-8"?><html xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<style>
<!--/*paged media */ div.header {display: none }div.footer {display: none } /*@media print { */@page { size: A4; margin: 10%; @top-center {content: element(header) } @bottom-center {content: element(footer) } }/*element styles*/ .del  {text-decoration:line-through;color:red;} .ins {text-decoration:none;background:#c0ffc0;padding:1px;}
 /* TABLE STYLES */ 

 /* PARAGRAPH STYLES */ 
.DocDefaults {display:block;margin-bottom: 4mm;line-height: 115%;font-size: 11.0pt;}
.Normal {display:block;}

 /* CHARACTER STYLES */ span.DefaultParagraphFont {display:inline;}
-->
</style>
<script type="text/javascript">
<!--function toggleDiv(divid){if(document.getElementById(divid).style.display == 'none'){document.getElementById(divid).style.display = 'block';}else{document.getElementById(divid).style.display = 'none';}}
--></script>
</head>
<body>

  <!-- userBodyTop goes here -->




<div class="document">


<p class="Normal DocDefaults " style="text-align: left;position: relative; margin-left: 17mm;text-indent: -0.25in;margin-bottom: 0in;">&bull;  <span class="DefaultParagraphFont " style="font-weight: normal;color: #000000;font-style: normal;font-size: 11.0pt;">TEST LINE 1</span>
</p>


<p class="Normal DocDefaults " style="text-align: left;position: relative; margin-left: 17mm;text-indent: -0.25in;margin-bottom: 0in;">&bull;  <span class="DefaultParagraphFont " style="font-weight: normal;color: #000000;font-style: normal;font-size: 11.0pt;">TEST LINE 2</span>
</p>
</div>







  <!-- userBodyTail goes here -->


</body>
</html>

There is a lot of extra space created after each line now. Not sure why this is happening, the conversion appears to add a lot of extra white space/carriage returns.

2条回答
唯我独甜
2楼-- · 2019-06-09 20:06

Its not clear from your question whether you are worried about whitespace in the (X)HTML source document, or in your page as rendered (presumably in CKEditor). If the latter, then the browser and CK version may be relevant.

Whitespace may or may not be significant; try Googling 'xhtml significant whitespace' for more.

By way of background, depending on docx4j property docx4j.Convert.Out.HTML.OutputMethodXML, docx4j will use

<xsl:output method="html" encoding="utf-8" omit-xml-declaration="no" indent="no" 
doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"
      doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>

or

  <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="no" indent="no" 
        doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"
        doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>

Note the different in the value of @method. If you want something different, you can modify docx2html.xsl or docx2xhtml.xsl respectively.

查看更多
我只想做你的唯一
3楼-- · 2019-06-09 20:11

Is there a way to convert wordMLPackage to html without all the extra stuff like:

<?xml version="1.0" encoding="UTF-8"?>

and the css?

Could it just be something simple as the original html and inline css like <html><body><div style="...."></div></body></html> ?

查看更多
登录 后发表回答