What are the options to parse Markdown document and process its elements to output an another Markdown document?
Let's say it
```
# unaffected #
```
# H1 #
H1
==
## H2 ##
H2
--
### H3 ###
should be converted to
```
# unaffected #
```
## H1 ##
H1
--
### H2 ###
### H2 ###
#### H3 ####
in Node environment. Target element may vary (e.g. #### may be converted to **).
The document may contain other markup elements that should remain unaffected.
How it can be obtained? Obviously, not with regexps (using regexp instead of full-blown lexer will affect # unaffected #
). I was hoped to use marked
but it seems that it is capable only of HTML output, not Markdown.
Have you considered using HTML as an intermediate format? Once in HTML, the differences between the header types will be indistinguishable, so the Markdown -> HTML conversion will effectively normalize them for you. There are markdown -> HTML converters aplenty, and also a number of HTML -> markdown.
I put together an example using these two packages:
- https://www.npmjs.com/package/markdown-it for Markdown -> HTML
- https://www.npmjs.com/package/h2m for HTML -> Markdown
I don't know if you have any performance requirements here (read: this is slow...) but this is a very low investment solution. Take a look:
var md = require('markdown-it')(),
h2m = require('h2m');
var mdContent = `
\`\`\`
# unaffected #
\`\`\`
# H1 #
H1
==
## H2 ##
H2
--
### H3 ###
`;
var htmlContent = md.render(mdContent);
var newMdContent = h2m(htmlContent, {converter: 'MarkdownExtra'});
console.log(newMdContent);
You may have to play with a mix of components to get the correct dialect support and whatnot. I tried a bunch and couldn't quite match your output. I think perhaps the --
is being interpreted differently? Here's the output, I'll let you decide if it is good enough:
```
# unaffected #
```
# H1 #
# H1 #
## H2 ##
## H2 ##
### H3 ###
Here is a solution with an external markdown parser, pandoc
. It allows for custom filters in haskell or python to modify the input (there also is a node.js port). Here is a python filter that increases every header one level. Let's save that as header_increase.py
.
from pandocfilters import toJSONFilter, Header
def header_increase(key, value, format, meta):
if key == 'Header' and value[0] < 7:
value[0] = value[0] + 1
return Header(value[0], value[1], value[2])
if __name__ == "__main__":
toJSONFilter(header_increase)
It will not affect the code block. However, it might transform setex-style headers for h1 and h2 elements (using ===
or ---
) into atx-style headers (using #
), and vice-versa.
To use the script, one could call pandoc from the command line:
pandoc input.md --filter header_increase.py -o output.md -t markdown
With node.js, you could use pdc to call pandoc.
var pdc = require('pdc');
pdc(input_md, 'markdown', 'markdown', [ '--filter', './header_increase.py' ], function(err, result) {
if (err)
throw err;
console.log(result);
});
Despite its apparent simplicity, Markdown is actually somewhat complicated to parse. Each part builds upon the next, such that to cover all edge cases you need a complete parser even if you only want to process a portion of a document.
For example, various types of block level elements can be nested inside other block level elements (lists, blockquotes, etc). Most implementations rely on a vary specific order of events within the parser to ensure that the entire document is parsed correctly. If you remove one of the earlier pieces, many of the later pieces will break. For example, Markdown markup inside code blocks is not parsed as Markdown because one of the first steps is to find and identify the code blocks so that later steps in the parsing never see the code blocks.
Therefore, to accomplish your goal and cover all possible edge cases, you need a complete Markdown parser. However, as you do not want to output HTML, your options are somewhat limited and you will need to do some work to get a working solution.
There are basically three styles of Markdown parsers (I'm generalizing here):
- Use regex string substitution to swap out the Markdown markup for HTML Markup within the source document.
- Use a render which gets called by the parser (in each step) as it parses the document outputting a new document.
- Generate a tree object or list of tokens (specifics vary by implementation) which is rendered (converted to a string) to a new document in a later step.
The original reference implementation (markdown.pl) is of the first type and probably useless to you. I simply mention it for completeness.
Marked is of the second variety and while it could be used, you would need to write your own renderer and have the renderer modify the document at the same time as you render it. While generally a performat solution, it is not always the best method when you need to modify the document, especially if you need context from elsewhere within the document. However, you should be able to make it work.
For example, to adapt an example in the docs, you might do something like this (multiplyString
borrowed from here):
function multiplyString (str, num) {
return num ? Array(num + 1).join(str) : "";
}
renderer.heading = function (text, level) {
return multiplyString("#", level+1) + " " + text;
}
Of course, you will also need to create renderers for all of the other block level renderer methods and inline level renderer methods which output Markdown syntax. See my comments below regarding renderers in general.
Markdown-JS is of the third variety (as it turns out Marked also provides a lower level API with access to the tokens so it could be used this way as well). As stated in its README:
Intermediate Representation
Internally the process to convert a chunk of Markdown into a chunk of
HTML has three steps:
- Parse the Markdown into a JsonML tree. Any references found in the
parsing are stored in the attribute hash of the root node under the
key
references
.
- Convert the Markdown tree into an HTML tree. Rename any nodes that
need it (
bulletlist
to ul
for example) and lookup any references
used by links or images. Remove the references attribute once done.
- Stringify the HTML tree being careful not to wreck whitespace where
whitespace is important (surrounding inline elements for example).
Each step of this process can be called individually if you need to do
some processing or modification of the data at an intermediate stage.
You could take the tree object in either step 1 or step 2 and make your modifications. However, I would recommend step 1 as the JsonML tree will more closely match the actual Markdown document as the HTML Tree in step 2 is a representation of the HTML to be output. Note that the HTML will loose some information regarding the original Markdown in any implementation. For example, were asterisks or underscores used for emphasis (*foo*
vs. _foo_
), or was a asterisk, dash (hyphen) or plus sign used as a list bullet? I'm not sure how much detail the JsonML tree holds (haven't used it personally), but it should certainly be more than the HTML tree in step 2.
Once you have made your modifications to the JsonML tree (perhpas using one of the tools listed here, then you will probably want to skip step 2 and implement your own step 3 which renders (stringifies) the JsonML tree back to a Markdown document.
And therein lies the hard part. It is very rare for Markdown parsers to output Markdown. In fact it is very rare for Markdown parsers to output anything except HTML. The most popular exception being Pandoc, which is a document converter for many formats of input and output. But, desiring to stay with a JavaScript solution, any library you chose will require you to write your own renderer which will output Markdown (unless a search turns up a renderer built by some other third party). Of course, once you do, if you make it available, others could benefit from it in the future. Unfortunately, building a Markdown renderer is beyond the scope of this answer.
One possible shortcut when building a renderer is that if the Markdown lib you use happens to store the position information in its list of tokens (or in some other way gives you access to the original raw Markdown on a per element basis), you could use that info in the renderer to simply copy and output the original Markdown text, except when you need to alter it. For example, the markdown-it lib offers that data on the Token.map
and/or Token.markup
properties. You still need to create your own renderer, but it should be easier to get the Markdown to look more like the original.
Finally, I have not personally used, nor am I recommending any of the specific Markdown parsers mentioned above. They are simply popular examples of the various types of parsers to demonstrate how you could create a solution. You may find a different implementation which fits your needs better. A lengthy, albeit incomplete, list is here.
You must use regexps. marked
itself use Regexp for parsing the document. Why don't you?
This is some of the regexp you need, from marked.js source code on github:
var block = {
newline: /^\n+/,
code: /^( {4}[^\n]+\n*)+/,
fences: noop,
hr: /^( *[-*_]){3,} *(?:\n+|$)/,
heading: /^ *(#{1,6}) *([^\n]+?) *#* *(?:\n+|$)/,
nptable: noop,
lheading: /^([^\n]+)\n *(=|-){2,} *(?:\n+|$)/,
blockquote: /^( *>[^\n]+(\n(?!def)[^\n]+)*\n*)+/,
list: /^( *)(bull) [\s\S]+?(?:hr|def|\n{2,}(?! )(?!\1bull )\n*|\s*$)/,
html: /^ *(?:comment *(?:\n|\s*$)|closed *(?:\n{2,}|\s*$)|closing *(?:\n{2,}|\s*$))/,
def: /^ *\[([^\]]+)\]: *<?([^\s>]+)>?(?: +["(]([^\n]+)[")])? *(?:\n+|$)/,
table: noop,
paragraph: /^((?:[^\n]+\n?(?!hr|heading|lheading|blockquote|tag|def))+)\n*/,
text: /^[^\n]+/
};
If you really really don't want to use regexp, you can fork the marked
object. and overide the Renderer
object.
Marked on github is splited to two components. One for parsing and one for render. You can eaisly change the render to your own render. (compiler)
Example of one function in Render.js:
Renderer.prototype.blockquote = function(quote) {
return '<blockquote>\n' + quote + '</blockquote>\n';
};)
Maybe it's incomplete answer.
Copy unaffected into other file.
Then replace all
#space
with ##space
space#
with space##