Is there a PHP implementation of markdown suitable for using in public comments?
Basically it should only allow a subset of the markdown syntax (bold, italic, links, block-quotes, code-blocks and lists), and strip out all inline HTML (or possibly escape it?)
I guess one option is to use the normal markdown parser, and run the output through an HTML sanitiser, but is there a better way of doing this..?
We're using PHP markdown Extra for the rest of the site, so we'd already have to use a secondary parser (the non-"Extra" version, since things like footnote support is unnecessary).. It also seems nicer parsing only the *bold*
text and having everything escaped to <a href="etc">
, than generating <b>bold</b>
text and trying to strip the bits we don't want..
Also, on a related note, we're using the WMD control for the "main" site, but for comments, what other options are there? WMD's javascript preview is nice, but it would need the same "neutering" as the PHP markdown processor (it can't display images and so on, otherwise someone will submit and their working markdown will "break")
Currently my plan is to use the PHP-markdown -> HTML santiser method, and edit WMD to remove the image/heading syntax from showdown.js
- but it seems like this has been done countless times before..
Basically:
- Is there a "safe" markdown implementation in PHP?
- Is there a HTML/javascript markdown editor which could have the same options easily disabled?
Update: I ended up simply running the markdown()
output through HTML Purifier.
This way the Markdown rendering was separate from output sanitisation, which is much simpler (two mostly-unmodified code bases) more secure (you're not trying to do both rendering and sanitisation at once), and more flexible (you can have multiple sanitisation levels, say a more lax configuration for trusted content, and a much more stringent version for public comments)
PHP Markdown has a sanitizer option, but it doesn't appear to be advertised anywhere. Take a look at the top of the Markdown_Parser
class in markdown.php
(starts on line 191 in version 1.0.1m). We're interested in lines 209-211:
# Change to `true` to disallow markup or entities.
var $no_markup = false;
var $no_entities = false;
If you change those to true
, markup and entities, respectively, should be escaped rather than inserted verbatim. There doesn't appear to be any built-in way to change those (e.g., via the constructor), but you can always add one:
function do_markdown($text, $safe=false) {
$parser = new Markdown_Parser;
if ($safe) {
$parser->no_markup = true;
$parser->no_entities = true;
}
return $parser->transform($text);
}
Note that the above function creates a new parser on every run rather than caching it like the provided Markdown
function (lines 43-56) does, so it might be a bit on the slow side.
JavaScript Markdown Editor Hypothesis:
- Use a JavaScript-driven Markdown Editor, e.g., based on showdown
- Remove all icons and visual clues from the Toolbar for unwanted items
- Set up a JavaScript filter to clean-up unwanted markup on submission
- Test and harden all JavaScript changes and filters locally on your computer
- Mirror those filters in the PHP submission script, to catch same on the server-side.
- Remove all references to unwanted items from Help/Tutorials
I've created a Markdown editor in JavaScript, but it has enhanced features. That took a big chunk of time and SVN revisions. But I don't think it would be that tough to alter a Markdown editor to limit the HTML allowed.
If you're looking to write your own parser, why not use the BBCode architecture.
When submitting your/(user) comments you need to sanitize the text with mysql_escape_real_string(), yes there other functions but this will stop any JS Injections.
How about running htmlspecialchars on the user entered input, before processing it through markdown? It should escape anything dangerous, but leave everything that markdown understands.
I'm trying to think of a case where this wouldn't work but can't think of anything off hand.