I want to allow embedding of HTML but avoid DoS due to deeply nested HTML documents that crash some browsers. I'd like to be able to accommodate 99.9% of documents, but reject those that nest too deeply.
Two closely related question:
- What document depth limits are built into browsers? E.g. browser X fails to parse or does not build documents with depth > some limit.
- Are document depth statistics for documents available on the web? Is there a site with web statistics that explains that some percentage of real documents on the web have document depths less than some value.
Document depth is defined as 1 + the maximum number of parent traversals needed to reach the document root from any node in a document. For example, in
<html> <!-- 1 -->
<body> <!-- 2 -->
<div> <!-- 3 -->
<table> <!-- 4 -->
<tbody> <!-- 5 -->
<tr> <!-- 6 -->
<td> <!-- 7 -->
Foo <!-- 8 -->
the maximum depth is 8 since the text node "Foo" has 8 ancestors. Ancestor here is interpreted non-strictly, i.e. ever node is its own ancestor and its own descendent.
Opera has some table nesting stats, which suggest that 99.99% of documents have a table nesting depth of less than 22, but that data does not contain whole document depth.
EDIT:
If people would like to criticize the HTML sanitization library instead of answering this question, please do. http://code.google.com/p/owasp-java-html-sanitizer/wiki/AttackReviewGroundRules explains how to find the code, where to find a testbed that lets you try out attacks, and how to report issues.
EDIT:
I asked Adam Barth, and he very kindly pointed me to webkit code that handles this.
Webkit, at least, enforces this limit. When a treebuilder is created it receives a tree limit that is configurable:
m_treeBuilder(HTMLTreeBuilder::create(this, document, reportErrors, usePreHTML5ParserQuirks(document), maximumDOMTreeDepth**(document)))
and it is tested by the block-nesting-cap test.
For webkit, the maximum document depth is configurable, but by default it is 512
http://trac.webkit.org/browser/trunk/Source/WebCore/page/Settings.h#L408
It may be worth asking coderesearch@google.com. Their study from 2005 (http://code.google.com/webstats/) doesn't cover your particular question. They sampled more than a billion documents though, and are interested in hearing about anything you feel is worth examining.
--[Update]--
Here's a crude script I wrote to test the browsers I have (putting the number of elements to nest into the query string):
And here are my findings (may be specific to my machine, Win XP, 3Gb Ram):
More on Chrome:
Changing the DIV to a SPAN resulted in Chrome being able to nest 9202 elements before crashing. So it's not the size of the HTML that is the reason (although SPAN elements may be more lightweight).
Nesting 2077 table cells (
<table><tr><td>
) worked (6231 elements), until you scrolled down to cell 445, then it crashed, so you can't nest 445 Table Cells (1335 elements).Testing with files generated from the script (as opposed to writing to new windows) give slightly higher tolerances, but Chrome still crashed.
You can nest 1409 list items (
<ul><li>
) before it crashes, which is interesting because:Setting a DOCTYPE is effective in IE8 (putting it into standards mode, i.e.
var outboundHtml = '<!DOCTYPE html>';
): It will not nest 792 list items (the tab crashes/closes) or 1593 DIVs. It made no difference in IE8 whether the test was generated from the script or loaded from a file.So the nesting limit of a browser apparently depends on the type of HTML elements the attacker is injecting, and the layout engine. There could be some HTML considerably smaller than this. And we have a plain-HTML DoS for IE8, Chrome and Safari users with a considerably small payload.
It seems if you are going to allow users to post HTML that gets rendered on one of your pages, it is worth considering a limit on nested elements if there is a generous size limit.