I need to convert the HTML entity characters to their unicode versions. For example, when I have &
, I would like just &
. Is there a special function for this or do I have to use the function replace()
for each couple of HTML Entity character
<--> Unicode character
?
Thanks in advance.
Even though there's no DOM in Apps Script, you can parse out HTML and get the plain text this way:
function getTextFromHtml(html) {
return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
switch(x.toString()) {
case 'XmlText': return x.toXmlString();
case 'XmlElement': return x.getNodes().map(getTextFromNode).join('');
default: return '';
}
}
calling
getTextFromHtml("hello <div>foo</div>& world <br /><div>bar</div>!");
will return
"hello foo& world bar!".
To explain, Xml.parse with the second param as "true" parses the document as an HTML page. We then walk the document (which will be patched up with missing HTML and BODY elements, etc. and turned into a valid XHTML page), turning text nodes into text and expanding all other nodes.
In Javascript, (I assume that's what you're using), there's no builtin function, but you can assign the content to an html tag and then read the text out. Here's an example with jQuery:
function htmlDecode(value){
return $('<div/>').html(value).text();
}
Note that the tag does not need to actually be attached to the DOM. This just creates a new tag, reads out its contents, and then throws it away. You can accomplish something very similar in vanilla Javascript with just a few extra lines.