We have a special requirement in a project where we have to parse a string of HTML (from an AJAX response) client side via JavaScript only. Thats right no parsing in PHP or Java! I've been going through StackOverflow, this entire week and have yet not got an acceptable solution.
Some more details on the requirements:
We can use any library (preferably dojo and / or jQuery) or go native!
We need to parse an Entire HTML Document that we receive as a string, including the
<head>
and<body>
.We also need to serialise out the parsed DOM structures to strings at times.
Finally, We don't want to append the parsed DOM to the current Document. Rather, we'll send it back to the server for permanent storage.
Eg: We need something like
var dom = HTMLtoDOM('<html><head><title> This is the old title. </title></head></html>');
dom.getElementsByTagName('title')[0].innerHTML = "This is a new Title";
With my research, these are our options:
A TinyMCE Parser. Problem? We need to necessarily include an editor I think. How about for parsing HTML where we don't need an editor?
John Resig's Parser. Should be our best bet. Unfortunately, the parser is crashing when the entire contents of a page is given to it!
The jQuery $(htmlString) or the dojo.toDom(htmlString). Both rely on DocumentFragment and hence gobble up
<head>
and<body>
!
EDIT: We want to serialize the HTML so we may catch certain custom HTML Commnets via RegExp. We need to give users the opportunity to edit meta tags, title tags etc hence the HTML Parser.
Oh and I feel I will be murdered in Stack Overflow even if I just hint at parsing HTML via RegExp!!!
If you want a full parser that isn't relying some existing thing in the browser to bootstrap your interpreter, the HTML parser in dom.js is top notch. It's entire purpose is to parse html for use in a javascript hosted DOM, so it has to cater to both the DOM specifications as well as the need to parse and use the results in js, all while not assuming any existing tools besides base JS. It works in node.js or spidermonkey's jsshell or webworkers even. https://github.com/andreasgal/dom.js
It also has the serialization part, but to do that you'll need to commit to using more than just the parser part. You can find standalone serializers though that work with any DOM like structure.
I would suggest a 2-part solution whereby you read off the tags that jQuery will not parse for you, and then pass the remainder into jQuery. If you're looking for a pure-javascript solution to parse HTML data structure, jQuery is probably your best bet as it has many built-in functions to manipulate the data. You could actually build your plugin as a jQuery plugin which could be called via: $.parser or something of that nature. If you extend jQuery with your own function to parse the data, you can also return an extended jQuery object that contains functions to read specific data elements even from the header since you can manually parse the ... information and store it in the same object.
You can leverage the current document without appending any nodes to it.
Try something like this:
http://jsfiddle.net/6SvqA/3/
Since HTML essentially is XML you can use jquery parseXML
Edit:
If you want to get it back into a string you will need to use the xml plugin, but I cannot find its original source so here it is:
I do not know why anybody should need this, but I suggest you simply dump your source into an iframe. The browser can do the parsing for you. You can even run DOM queries on the result.