Add HTML String to OpenXML (*.docx) Document

2019-01-21 20:11发布

问题:

I am trying to use Microsoft's OpenXML 2.5 library to create a OpenXML document. Everything works great, until I try to insert an HTML string into my document. I have scoured the web and here is what I have come up with so far (snipped to just the portion I am having trouble with):

Paragraph paragraph = new Paragraph();
Run run = new Run();

string altChunkId = "id1";
AlternativeFormatImportPart chunk =
       document.MainDocumentPart.AddAlternativeFormatImportPart(
           AlternativeFormatImportPartType.Html, altChunkId);
chunk.FeedData(new MemoryStream(Encoding.UTF8.GetBytes(ioi.Text)));
AltChunk altChunk = new AltChunk { Id = altChunkId };

run.AppendChild(new Break());

paragraph.AppendChild(run);
body.AppendChild(paragraph);

Obviously, I haven't actually added the altChunk in this example, but I have tried appending it everywhere - to the run, paragraph, body, etc. In ever case, I am unable to open up the docx file in Word 2010.

This is making me a little nutty because it seems like it should be straightforward (I will admit that I'm not fully understanding the AltChunk "thing"). Would appreciate any help.

Side Note: One thing I did find that was interesting, and I don't know if it's actually a problem or not, is this response which says AltChunk corrupts the file when working from a MemoryStream. Can anybody confirm that this is/isn't true?

回答1:

I can reproduce the error "... there is a problem with the content" by using an incomplete HTML document as the content of the alternative format import part. For example if you use the following HTML snippet <h1>HELLO</h1> MS Word is unable to open the document.

The code below shows how to add an AlternativeFormatImportPart to a word document. (I've tested the code with MS Word 2013).

using (WordprocessingDocument doc = WordprocessingDocument.Open(@"test.docx", true))
{
  string altChunkId = "myId";
  MainDocumentPart mainDocPart = doc.MainDocumentPart;

  var run = new Run(new Text("test"));
  var p = new Paragraph(new ParagraphProperties(
       new Justification() { Val = JustificationValues.Center }),
                     run);

  var body = mainDocPart.Document.Body;
  body.Append(p);        

  MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes("<html><head></head><body><h1>HELLO</h1></body></html>"));

  // Uncomment the following line to create an invalid word document.
  // MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes("<h1>HELLO</h1>"));

  // Create alternative format import part.
  AlternativeFormatImportPart formatImportPart =
     mainDocPart.AddAlternativeFormatImportPart(
        AlternativeFormatImportPartType.Html, altChunkId);
  //ms.Seek(0, SeekOrigin.Begin);

  // Feed HTML data into format import part (chunk).
  formatImportPart.FeedData(ms);
  AltChunk altChunk = new AltChunk();
  altChunk.Id = altChunkId;

  mainDocPart.Document.Body.Append(altChunk);
}

According to the Office OpenXML specification valid parent elements for the w:altChunk element are body, comment, docPartBody, endnote, footnote, ftr, hdr and tc. So, I've added the w:altChunk to the body element.

For more information on the w:altChunk element see this MSDN link.

EDIT

As pointed out by @user2945722, to make sure that the OpenXml library correctlty interprets the byte array as UTF-8, you should add the UTF-8 preamble. This can be done this way:

MemoryStream ms = new MemoryStream(new UTF8Encoding(true).GetPreamble().Concat(Encoding.UTF8.GetBytes(htmlEncodedString)).ToArray()

This will prevent your é's from being rendered as é's, your ä's as ä's, etc.



回答2:

Had the same problem here, but a totally different cause. Worth a try if the accepted solution doesn't help. Try closing the file after saving. In my case, it happened to be the difference between a corrupt and a clean docx file. Oddly, most other operations work with only a Save() and program exit.

String cid = "chunkid";
WordprocessingDocument document = WordprocessingDocument.Open("somefile.docx", true);
Body body = document.MainDocumentPart.Document.Body;
MemoryStream ms = new MemoryStream(System.Text.Encoding.UTF8.GetBytes("<html><head></head><body>hi</body></html>"));
AlternativeFormatImportPart formatImportPart = document.MainDocumentPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Html, cid);
formatImportPart.FeedData(ms);
AltChunk altChunk = new AltChunk();
altChunk.Id = cid;
document.MainDocumentPart.Document.Body.Append(altChunk);
document.MainDocumentPart.Document.Save();
// here's the magic!
document.Close();