Here's something I've been having a little bit of difficulty with. I have a local client-side script that needs to allow a user to fetch a remote web page and search that resulting page for forms. In order to do this (without regex), I need to parse the document into a fully traversable DOM object.
Some limitations I'd like to stress:
- I don't want to use libraries (like jQuery). There's too much bloat for what I need to do here.
- Under no circumstances should scripts from the remote page be executed (for security reasons).
- DOM APIs, such as
getElementsByTagName
, need to be available. - It only needs to work in Internet Explorer, but in 7 at the very least.
- Let's pretend I don't have access to a server. I do, but I can't use it for this.
What I've tried
Assuming I have a complete HTML document string (including DOCTYPE declaration) in the variable html
, here's what I've tried so far:
var frag = document.createDocumentFragment(),
div = frag.appendChild(document.createElement("div"));
div.outerHTML = html;
//-> results in an empty fragment
div.insertAdjacentHTML("afterEnd", html);
//-> HTML is not added to the fragment
div.innerHTML = html;
//-> Error (expected, but I tried it anyway)
var doc = new ActiveXObject("htmlfile");
doc.write(html);
doc.close();
//-> JavaScript executes
I've also tried extracting the <head>
and <body>
nodes from the HTML and adding them to a <HTML>
element inside the fragment, still no luck.
Does anyone have any ideas?
Not sure why you're messing with documentFragments, you can just set the HTML text as the
innerHTML
of a new div element. Then you can use that div element forgetElementsByTagName
etc without adding the div to DOM:If you're really married to the idea of a documentFragment, you can use this code, but you'll still have to wrap it in a div to get the DOM functions you're after:
Just wandered across this page, am a bit late to be of any use :) but the following should help anyone with a similar problem in future... however IE7/8 should really be ignored by now and there are much better methods supported by the more modern browsers.
The following works across nearly eveything I've tested - the only two down sides are:
I've added bespoke
getElementById
andgetElementsByName
functions to the root div element, so these wont appear as expected futher down the tree (unless the code is modified to cater for this).The doctype will be ignored - however I don't think this will make much difference as my experience is that the doctype wont effect how the dom is structured, just how it is rendered (which obviously wont happen with this method).
Basically the system relies on the fact that
<tag>
and<namespace:tag>
are treated differently by the useragents. As has been found certain special tags can not exist within a div element, and so therefore they are removed. Namespaced elements can be placed anywhere (unless there is a DTD stating otherwise). Whilst these namespace tags wont actually behave as the real tags in question, considering we are only really using them for their structural position in the document it doesn't really cause a problem.markup and code are as follows:
Fiddle: http://jsfiddle.net/JFSKe/6/
DocumentFragment
doesn't implement DOM methods. Usingdocument.createElement
in conjunction withinnerHTML
removes the<head>
and<body>
tags (even when the created element is a root element,<html>
). Therefore, the solution should be sought elsewhere. I have created a cross-browser string-to-DOM function, which makes use of an invisible inline-frame.All external resources and scripts will be disabled. See Explanation of the code for more information.
Code
Explanation of the code
The
sanitiseHTML
function is based on myreplace_all_rel_by_abs
function (see this answer). ThesanitiseHTML
function is completely rewritten though, in order to achieve maximum efficiency and reliability.Additionally, a new set of RegExps are added to remove all scripts and event handlers (including CSS
expression()
, IE7-). To make sure that all tags are parsed as expected, the adjusted tags are prefixed by<!--'"-->
. This prefix is necessary to correctly parse nested "event handlers" in conjunction with unterminated quotes:<a id="><input onclick="<div onmousemove=evil()>">
.These RegExps are dynamically created using an internal function
cr
/cri
(Create Replace [Inline]). These functions accept a list of arguments, and create and execute an advanced RE replacement. To make sure that HTML entities aren't breaking a RegExp (refresh
in<meta http-equiv=refresh>
could be written in various ways), the dynamically created RegExps are partially constructed by functionae
(Any Entity).The actual replacements are done by function
by
(replace by). In this implementation,by
addsdata-
before all matched attributes.<script>//<[CDATA[ .. //]]></script>
occurrences are striped. This step is necessary, becauseCDATA
sections allow</script>
strings inside the code. After this replacement has been executed, it's safe to go to the next replacement:<script>...</script>
tags are removed.<meta http-equiv=refresh .. >
tag is removedAll event listeners and external pointers/attributes (
href
,src
,url()
) are prefixed bydata-
, as described previously.An
IFrame
object is created. IFrames are less likely to leak memory (contrary to the htmlfile ActiveXObject). The IFrame becomes invisible, and is appended to the document, so that the DOM can be accessed.document.write()
are used to write HTML to the IFrame.document.open()
anddocument.close()
are used to empty the previous contents of the document, so that the generated document is an exact copy of the givenhtml
string.document
object. The second argument is a function, which destroys the generated DOM tree when called. This function should be called when you don't need the tree any more.If the callback function isn't specified, the function returns an object consisting of two properties (
doc
anddestroy
), which behave the same as the previously mentioned arguments.Additional notes
designMode
property to "On" will stop a frame from executing scripts (not supported in Chrome). If you have to preserve the<script>
tags for a specific reason, you can useiframe.designMode = "On"
instead of the script stripping feature.htmlfile activeXObject
. According to this source,htmlfile
is slower than IFrames, and more susceptible to memory leaks.href
,src
, ...) are prefixed bydata-
. An example of getting/changing these attributes is shown fordata-href
:elem.getAttribute("data-href")
andelem.setAttribute("data-href", "...")
elem.dataset.href
andelem.dataset.href = "..."
.No external styles<link rel="stylesheet" href="main.css" />
No scripted styles<script>document.body.bgColor="red";</script>
<img src="128x128.png" />
No images: the size of the element may be completely different.Examples
sanitiseHTML(html)
Paste this bookmarklet in the location's bar. It will offer an option to inject a textarea, showing the sanitised HTML string.
Code examples -
string2dom(html)
:Notable references
sanitiseHTML(html)
is based on my previously createdreplace_all_rel_by_abs(html)
function.<applet>
)I'm not sure if IE supports
document.implementation.createHTMLDocument
, but if it does, use this algorithm (adapted from my DOMParser HTML extension). Note that the DOCTYPE will not be preserved.:DocumentFragment
doesn't supportgetElementsByTagName
-- that's only supported byDocument
.You may need to use a library like jsdom, which provides an implementation of the DOM and through which you can search using
getElementsByTagName
and other DOM APIs. And you can set it to not execute scripts. Yes, it's 'heavy' and I don't know if it works in IE 7.To use full HTML DOM abilities without triggering requests, without having to deal with incompatibilities:
All set ! doc is an html document, but it is not online.