Is there a program or workflow to convert .doc
or .docx
files to Markdown or similar text?
PS: Ideally, I would welcome the option that a specific font (e.g. consolas
) in the MS Word document will be rendered to text-code: ```....```
.
Is there a program or workflow to convert .doc
or .docx
files to Markdown or similar text?
PS: Ideally, I would welcome the option that a specific font (e.g. consolas
) in the MS Word document will be rendered to text-code: ```....```
.
Pandoc supports conversion from docx to markdown directly:
Several markdown formats are supported:
Options
Which Conversion Tools?
I've tested these three:
(1)-Pandoc / (2)-Mammoth / (3)-w2m
PandocBy far the superior tool for conversions with support for a multitude of file types (see Pandoc's
man page
for supported file types):
NBTo get
pandoc
to export markdown tables ('pipe_tables' in pandoc) usemultimarkdown
orgfm
output formats.If formatting to PDF,
pandoc
usesLaTeX
templates for this so you may need to install theLaTeX
package for your OS if that command does not work out of the box. Instructions at LaTeX InstallationWhich WYSIWYG Editors?
WriteageIn answer to this specific question (
docx --> markdown
), use the Writeage plugin for Microsoft Word. It also works the other way roundmarkdown --> docx
.If you wish to preserve unicode characters, emojis and maintain superior fonts, you'll get some milage from the editors below when using copy-and-paste operations between file formats. Note, these do not read or write natively to
docx
.Update: A4 vs US Letter
For outside the US, set the geometry variable:
FootnoteIts worth mentioning here - what's not that obvious when discovering Markdown is that MultiMarkdown is by far the most feature rich markdown format, supporting amongst other things - metadata, table of contents, footnotes, maths, tables and YAML.
But Github's default format uses
gfm
which also supports tables. I usegfm
for Github/GitLab andMultiMarkdown
for everything else.For bulleted lists you can paste a list into Sublime Text and use multiselect ( tested ) or find and replace ( not tested ) to replace eg the proprietary MS Word characters with
-
,--
etcThis doesn't work with headings but it may be possible to use a similar technique with other elements.
Word to Markdown might be worth a shot, or the procedure described here using Calibre and Pandoc via HTMLZ, here's a bash script they use:
Mammoth is best known as a Word to HTML converter but it now supports a Markdown writer module. When I last checked, Mammoth Markdown support was still in its early stages, so you may find some features are unsupported. As usual ... check the website for the latest details.
Install
To use the Javascript version ... install NodeJS and then install Mammoth:
Command line
Command line to convert a Word document to Markdown ...
API
NodeJS API to convert to Markdown ...
Features:
Mammoth Markdown writer currently supports:
The Mammoth command line tools and API have been ported to several languages:
With NO Markdown (May 2016):
With Markdown:
You can convert Word documents from within MS Word to Markdown using this Visual Basic Script:
https://gist.github.com/hawkrives/2305254
Follow the instructions under "To use the code" to create a new Macro in Word.
Note: This converts the currently open Word document ato Markdown, which removes all the Word formatting (headings, lists, etc.). First save the Word document you plan to converts, and then save the document again as a new document before running the macro. This way you can always go back to the original Word document to make changes.
There are more examples of Word to markdown VB scripts here:
https://www.mediawiki.org/wiki/Microsoft_Word_Macros