from HTML
and
to Microsoft W

2019-02-17 08:28发布

问题:

I have an HTML with thefigure, img and figcaption tags and I would like to get them converted to a Microsoft Word document.

The image referred by img should be inserted in the Word document and the figcaption should be converted to its caption (also keeping the figure number).

I have tried to open the html with Word 2013 but the figcaption is not converted as the figure caption but it is just a simple text below the image.

Is there any minimum working sample to get it done? I had a look at https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats#Word_XML_Format_example but it is too verbose to grab just an Hello world sample.

figure .image {
    width: 100%;
}

figure {
    text-align: center;
    display: table;
    max-width: 30%; /* demo; set some amount (px or %) if you can */
    margin: 10px auto; /* not needed unless you want centered */
}
article {
  counter-reset: figures;
}

figure {
  counter-increment: figures;
}

figcaption:before {
  content: "Fig. " counter(figures) " - "; /* For I18n support; use data-counter-string. */
}
<figure>
<p><img class="image" src="https://upload.wikimedia.org/wikipedia/commons/c/ca/Matterhorn002.jpg"></p>
<figcaption>Il monte Cervino.</figcaption>
</figure>

<figure>
<p><img class="image" src="https://upload.wikimedia.org/wikipedia/commons/2/26/Banner_clouds.jpg"></p>
<figcaption>La nuvola che spesso è vicino alla vetta.</figcaption>
</figure>

I tried with pandoc on Windows

pandoc -f html -t docx -o hello.docx hello.html

but with no luck, as you can see the "Fig. 1" and "Fig. 2" is missing:

My pandoc is:

c:\temp>.\pandoc.exe -v
pandoc.exe 1.19.2.1
Compiled with pandoc-types 1.17.0.4, texmath 0.9, skylighting 0.1.1.4
Default user data directory: C:\Users\ale\AppData\Roaming\pandoc
Copyright (C) 2006-2016 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

Edit 1

It is fine also to use some C# to get it done. Maybe I can transform the HTML to some XML Word format by means of a C# program.

回答1:

This may be more roundabout than you would like, but if you save the file as a pdf (I went into adobe and created a pdf from a html file containing figure/figcaption, but you could do that programatically obviously), and then export that pdf file to word, then you can create a word document. Perhaps a middle step too much but it does work!

Hope this is of some assistance (perhaps a pdf would do??)

EDIT 1: I just found a jquery plugin by Mark Windsoll which converts HTML to Word. I made a codepen to include figure /figcaption here. When you press the button it prints as Word. (I suppose you could save it either, but his original code pen didn't actually do anything on click of the link that said export to doc.. sigh..)

 jQuery(document).ready(function print($)  {   
$(".word-export").click(function(event) {
         $("#page-content").wordExport();
     });
 });
img{width:300px;
height:auto;}
figcaption{width:350px;text-align:center;}
h1{margin-top:10px;}
h1, h2{margin-left:35px;}
p{width:95%;
  padding-top:20px;
  margin:0px auto;}
button{margin: 15px 30px; 
padding:5px;}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<script src="https://www.jqueryscript.net/demo/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin/FileSaver.js"></script>
<script src="https://www.jqueryscript.net/demo/Export-Html-To-Word-Document-With-Images-Using-jQuery-Word-Export-Plugin/jquery.wordexport.js"></script>

<link href="https://www.jqueryscript.net/css/jquerysctipttop.css" rel="stylesheet"/>

<h1>jQuery Word Export Plugin Demo</h1>
<div id="page-content">
<h2>Lovely Trees</h2>
<figure>
  <img src="http://www.rachelgallen.com/images/autumntrees.jpg"></figure>
  <figcaption>Autumn Trees</figcaption>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec vehicula bibendum lacinia. Pellentesque placerat interdum nisl non semper. Integer ornare, nunc non varius mattis, nulla neque venenatis nibh, vitae cursus risus quam ut nulla. Aliquam erat volutpat. Aliquam erat volutpat. </p>
  <p>And some more text here, but that's quite enough lorem ipsum rubbish!</p>
</div>
<button class="word-export" onclick="print();"> Export as .doc </button>

EDIT 2: To convert HTML to Word using C# you can use Gembox, which is free unless you buy the professional version (you could use it free for a while to evaluate it).

The C# code is

// Convert HTML to Word (DOCX) document.
DocumentModel.Load("Document.html").Save("Document.docx");

Rachel



回答2:

I never used pandoc, i guess it don't support many advanced CSS3 features now.

1. Using Aspose.Words

I copied you CSS&HTML codes to make a Html file named figure.htm, and using Aspose.Words to converted this html file, it works as well as your hope.

I using C# to code to like below:

using Aspose.Words;

        Document doc = new Document();
        DocumentBuilder builder = new DocumentBuilder(doc); 
        using (System.IO.StreamReader sr = new System.IO.StreamReader("./figure.htm"))
        {
            string html = sr.ReadToEnd();
            builder.InsertHtml(html);
        }

        doc.Save("d:\\DocumentBuilder.InsertTableFromHtml Out.doc");

My Aspose.Words version is 16.7.0.0.

2. Format figcaption tag

There is an other way to keep using pandoc to make it work. You can handle the Html file to fix format before you convert using pandoc. In your question, the base point is pandoc can't works on many advanced CSS3 features, so if you can finish this then it works well too.

I give some test code for you, and i using 'RegularExpressions'. Run below code, figure1.htm is a new HTML file and it's replace all figcaption's innter HTML to a fix format HTML.

        Regex regex = new Regex("<(?<tag>[a-zA-Z]+?)>(?<html>.+)</\\1>", RegexOptions.Compiled);
        using (System.IO.StreamReader sr = new System.IO.StreamReader("./figure.htm", Encoding.UTF8))
        {
            string html = sr.ReadToEnd();
            int i = 1;

            string newHtml = regex.Replace(html, new MatchEvaluator((m) =>
            {
                string tag = m.Groups["tag"].Value;
                string text = m.Groups["html"].Value;
                if (tag.ToLower() == "figcaption")
                {
                    return $"<{tag}>Fig. {i++} - {text}</{tag}>";
                }
                return m.Value;
            }));

            using (System.IO.StreamWriter sw = new System.IO.StreamWriter("./figure1.htm", false, Encoding.UTF8))
            {
                sw.Write(newHtml);
                sw.Flush();
            }
        }

Wish my answer can help you!



回答3:

Pandoc already downloads the images and embeds them in the docx file with the command you posted.

I've just implemented and submitted a pull request to parse the figure and figcaption HTML elements properly which has been merged into master now (so it will be in the nightly builds shortly or later in pandoc 2.0). With that code, your example generates a docx file with the caption text having Paragraph Style "Image Caption".



回答4:

To expand on Rachel Gallan's excellent find; the following is code I think might be used to run the converter on a string that contains a full HTML page generated by the Loop:

Would this work to convert output from a process that creates a page (the loop)? (Javascript and CSS loaded with wp_enqueue.. commands previous to calling this code)

    <?php 
    $x = $post_output ;  // $post_output contains an HTML page with doctype/head/body/etc that was generated by the loop
    $dom = new DOMDocument;
    libxml_use_internal_errors(false); // supress errors
    $dom->loadHTML($x, LIBXML_NOERROR); // supress errors
?>
<script type="text/javascript">
         $dom.wordExport();
</script>

...Rick...