Unescape HTML entities containing newline in Javas

2019-05-07 00:08发布

问题:

If you have a string containing HTML entities and want to unescape it, this solution (or variants thereof) is suggested multiple times:

function htmlDecode(input){
  var e = document.createElement('div');
  e.innerHTML = input;
  return e.childNodes.length === 0 ? "" : e.childNodes[0].nodeValue;
}

htmlDecode("<img src='myimage.jpg'>"); 
// returns "<img src='myimage.jpg'>"

(See, for example, this answer: https://stackoverflow.com/a/1912522/1199564)

This works fine as long as the string does not contain newline and we are not running on Internet Explorer version pre 10 (tested on version 9 and 8).

If the string contains a newline, IE 8 and 9 will replace it with a space character instead of leaving it unchanged (as it is on Chrome, Safari, Firefox and IE 10).

htmlDecode("Hello\nWorld"); 
// returns "Hello World" on IE 8 and 9

Any suggestions for a solution that works with IE before version 10?

回答1:

The most simple, but probably not the most efficient solution is to have htmlDecode() act only on character and entity references:

var s = "foo\n&amp;\nbar";
s = s.replace(/(&[^;]+;)+/g, htmlDecode);

More efficient is using an optimized rewrite of htmlDecode() that is only called once per input, acts only on character and entity references, and reuses the DOM element object:

function htmlDecode (input)
{
  var e = document.createElement("span");

  var result = input.replace(/(&[^;]+;)+/g, function (match) {
    e.innerHTML = match;
    return e.firstChild.nodeValue;
  });

  return result;
}

/* returns "foo\n&\nbar" */
htmlDecode("foo\n&amp;\nbar");

Wladimir Palant has pointed out an XSS issue with this function: The value of some (HTML5) event listener attributes, like onerror, is executed if you assign HTML with elements that have those attributes specified to the innerHTML property. So you should not use this function on arbitrary input containing actual HTML, only on HTML that is already escaped. Otherwise you should adapt the regular expression accordingly, for example use /(&[^;<>]+;)+/ instead to prevent &…; where contains tags from being matched.

For arbitrary HTML, please see his alternative approach, but note that it is not as compatible as this one.