可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Ok, so I have been reading about markdown here on SO and elsewhere and the steps between user-input and the db are usually given as

convert markdown to html
sanitize html (w/whitelist)
insert into database

but to me it makes more sense to do the following:

sanitize markdown (remove all tags - no exceptions)
convert to html
insert into database

Am I missing something? This seems to me to be pretty nearly xss-proof

回答1:

Please see this link:

http://michelf.com/weblog/2010/markdown-and-xss/

> hello <a name="n"
> href="javascript:alert('xss')">*you*</a>

Becomes

<blockquote>
 <p>hello <a name="n"
 href="javascript:alert('xss')"><em>you</em></a></p>
</blockquote>

∴ you must sanitize after converting to HTML.

回答2:

There are two issues with what you've proposed:

I don't see a way for your users to be able to format posts. You took advantage of Markdown to provide nice numbered lists, for example. In the proposed no-tags-no-exceptions world, I'm not seeing how the end user would be able to do such a thing.
Considerably more important: When using Markdown as the "native" formatting language, and whitelisting the other available tags,you are limiting not just the input side of the world, but the output as well. In other words, if your display engine expects Markdown and only allows whitelisted content out, even if (God forbid) somebody gets to the database and injects some nasty malware-laden code into a bunch of posts, the actual site and its users are protected because you are sanitizing it upon display, as well.

There are some good resources on the web about output sanitization:

Sanitizing user data: Where and how to do it
Output sanitization (One of my clients, who shall remain nameless and whose affected system was not developed by me, was hit with this exact worm. We have since secured those systems, of course.)
BizTech: Best Practices: Never heard of XSS?

回答3:

Well certainly removing/escaping all tags would make a markup language more secure. However the whole point of Markdown is that it allows users to include arbitrary HTML tags as well as its own forms of markup(*). When you are allowing HTML, you have to clean/whitelist the output anyway, so you might as well do it after the markdown conversion to catch everything.

*: It's a design decision I don't agree with at all, and one that I think has not proven useful at SO, but it is a design decision and not a bug.

Incidentally, step 3 should be ‘output to page’; this normally takes place at the output stage, with the database containing the raw submitted text.

回答4:

insert into database
convert markdown to html
sanitize html (w/whitelist)

perl

use Text::Markdown ();
use HTML::StripScripts::Parser ();

my $hss = HTML::StripScripts::Parser->new(
   {
       Context         => 'Document',
       AllowSrc        => 0,
       AllowHref       => 1,
       AllowRelURL     => 1,
       AllowMailto     => 1,
       EscapeFiltered  => 1,
   },
   strict_comment => 1,
   strict_names   => 1,
);

$hss->filter_html(Text::Markdown::markdown(shift))

回答5:

convert markdown to html

sanitize html (w/whitelist)

insert into database

Here, the assumptions are

Given dangerous HTML, the sanitizer can produce safe HTML.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.

sanitize markdown (remove all tags - no exceptions)

convert to html

insert into database

Here the assumptions are

Given dangerous markdown, the sanitizer can produce markdown that when converted to HTML by a different program will be safe.
The definition of safe HTML will not change, so if it is safe when I insert it into the DB, it is safe when I extract it.

The markdown sanitizer has to know not just about dangerous HTML and dangerous markdown, but how the markdown->HTML converter does its job. That makes it more complex, and more likely to be wrong than the simpler unsafeHTML->safeHTML function above.

As a concrete example, "remove all tags" assumes you can identify tags, and would not work against UTF-7 attacks. There might be other encoding attacks out there that render this assumption moot, or there might be a bug that causes the markdown->HTML program to convert (full-width '<', exotic white-space characters stripped by markdown, SCRIPT) into a <script> tag.

The most secure would be:

sanitize markdown (remove all tags - no exceptions)
convert markdown to HTML
sanitize HTML
insert into a DB column marked risky
re-sanitize HTML every time you fetch that column from the DB

That way, when you update your HTML sanitizer you get protection against any newly discovered attacks. This is often inefficient, but you can get pretty good security by storing a timestamp with HTML inserted so that you can tell which might have been inserted during the time when someone knew about an attack that gets past your sanitizer.