I'm using HtmlAgilityPack to parse roughly 200,000 HTML documents.
I cannot predict the contents of these documents, however one such document causes my application to fail with a StackOverflowException
. The document contains this HTML:
<ol>
<li><li><li><li><li><li>...
</ol>
There are roughly 10,000 <li>
elements nested like that. Due to the way HtmlAgilityPack parses HTML it causes a StackOverflowException
.
Unfortunately a StackOverflowException is not catchable in .NET 2.0 and later.
I did wonder about setting a larger size for the thread's stack, but setting a larger stack size is a hack: it would cause my program to use a lot more memory (my program starts about 50 threads for processing HTML, so all of these threads would have the increased stack size) and would need manually adjusting if it ever came across a similar situation again.
Are there any other workarounds I could employ?
I just patched an error that I believe is the same as your describing. Uploaded the patch to the hap project site...
http://www.codeplex.com/site/users/view/sjdirect (see the patch on 3/8/2012)
Or see more documentation of the issue and result here....
https://code.google.com/p/abot/issues/detail?id=77
The actual fix was...
Added HtmlDocument.OptionMaxNestedChildNodes that can be set to prevent StackOverflowExceptions that are caused by tons of nested tags. It will throw an ApplicationException with message "Document has more than X nested tags. This is likely due to the page not closing tags properly."
How I'm Using Hap After Patch...
HtmlDocument hapDoc = new HtmlDocument();
hapDoc.OptionMaxNestedChildNodes = 5000;//This is what was added
string rawContent = GETTHECONTENTHERE
try
{
hapDoc.LoadHtml(RawContent);
}
catch (Exception e)
{
//Instead of a stackoverflow exception you should end up here now
hapDoc.LoadHtml("");
_logger.Error(e);
}
Ideally, the long-term solution is to patch HtmlAgilityPack to use a heap-stack instead of the call-stack, but that would be an undertaking too big for me. I've temporarily lost my CodePlex account details, but when I get them back I'll submit an Issue report on the problem. I also note that this issue could present a Denial-of-Service attack vulnerability to any site that uses HtmlAgilityPack to sanitize user-submitted HTML - a crafted overly-nested HTML document would cause the w3wp.exe process to die.
In the meantime, I figured the best way forward is to manually override the maximum thread stack size. I was wrong in my earlier statement that a bigger stack-size means that all threads automatically consume that memory (it seems memory pages are allocated for a thread stack as it grows, not all-at-once).
I made a copy of the <ol><li>
page and ran some experiments. I found that my program failed when the stack size was less than 2^21
bytes in size, but a maximum size of 2^22
succeeded - that's 4MB and in my book passes as an "acceptable" hack... for now.