可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

When you are developing a web-based application and you want to allow richly formatted text from the user you have to make a choice about how to allow that input. Many different markup languages have been created because it is arguably more difficult to sanitize HTML.

What are the advantages and disadvantages of the various different markup languages like:

HTML
Markdown
BBCode
Textile
MediaWiki markup
other

Or to put it differently, what factors do you consider when choosing to use a particular markup language.

回答1:

Markdown, BBCode, Textile, MediaWiki markup are all basically the same general concept, so I would really just lump this into two categories: HTML, and plain text markup.

HTML

The deal with HTML is the content is already in a "presentable" form for web content. That's great, saves processing time, and it's a readily parse-able language. There are dozens of libraries in pretty much any language to handle HTML content, convert to/from HTML to other formats, etc. The main downside is that because of the loose standards of the early web days, HTML can be incredibly variable and you can't always depend on sane input when accepting HTML from users. As pointed out, tidying or santizing HTML is often very difficult, especially because it fails to follow normal markup rules the way XML does (i.e. improperly closed tags are common).

Plain Text Markup

This category is frequently used for the following reasons:

Easy to parse into multiple forms from one source - PDF, HTML, RTF
Content is stored in readable plain text (usually much easier to read than raw HTML) if needed at some later date, rather than needing to extract from the HTML
Follows specific defined rules where HTML can be annoying variable and unstructured
Allows you to force a subset of content formatting that's more appropriate in many cases than simply allowing full HTML
In addition to forcing a subset of HTML makes it easy to sanitize input and prevent cross site scripting problems etc.
Keeping the "raw" data in an abstracted format means that at a later date, if you for instance wanted to convert your site from HTML 4 to XHTML, you only need to change the parsing code. With HTML formatted user input, you're stuck now having to convert all the HTML to XHTML individually, which as HTML Tidy shows, is not always a simple task. Similarly if a new markup language comes along at some point or you need to move to an alternative format (RTF, PDF, TeX) an abstracted restricted subset of text formatting options makes that a much simpler task.

Bottom line is what is the user input being used for. If you're planning to keep the data around and may need to shuffle formats etc. then it makes sense to use a careful abstract format to store the information. If you need to work with the raw data manually for any reason, then bonus points if that format is easily human-readable. If you're only displaying the content in a web page (or HTML doc for a report etc.) and you have no concerns about converting it or future-proofing it, then it's a reasonable practice to store it in HTML.

回答2:

Jeff discussed some pros and cons on codinghorror.com while they were in the initial stages of putting together SO. I thought it was a worthwhile read.

回答3:

@netrox the database is not the issue, the browser output is.

The only concern is the final rendering which can be broken by the HTML inserted by the user. For example the user could open a <li> tag but never close it, which depending on how the page is structured, could potentially break the entire layout that follows. Or another example open a <strong> tag without closing it, making all the remaining content bold.

So not only allowed tags must be validated, but how exactly do you allow some tags but not the others? Because it is very easy to prevent parsing of all HTML tags using htmlspecialchars() PHP method, for example, but when it comes to allowing some of the tags you will have to look for other ways. There is the strip_tags() PHP function which removes (completely delete) non-allowed tags, but then that means altering the user's content in a bad way, preventing the user to post simple code for example (code to share/show, not code to process).

Beside breaking the layout, you must consider XSS attacks, like inserting javascript into the href attribute of a link, which for example could redirect users to another site. See this long list of possible XSS attacks: https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet

As you can see preventing all HTML tags from being interpreted is very easy, but preventing only some of the tags is much more complicated. To understand that, you could take a look at the enormous "HTML Purifier" framework which only purpose is to allow some HTML tags and make sure that the outputted HTML is valid (i.e. won't break the page) and free of XSS attacks.

回答4:

"Many different markup languages have been created because it is arguably more difficult to sanitize HTML."

Really? How is it difficult? There are functions to remove potentially dangerous attributes or tags and validate the HTML before you enter it in database or file. Can you give me examples of how it is difficult to sanitize HTML?