How to convert HTML to valid XHTML?

I have a string of HTML, in this example it looks like

<img src="somepic.jpg" someAtrib="1" >

I am trying to workout a peice of regex that will match the 'img' node and apply a slash to the end of the node so it looks like.

<img src="somepic.jpg" someAtrib="1" />

Essentially the end goal here is to ensure that the node is closed, open nodes are valid in HTML but not XML obviously. Are there any regex buff's out there able to help?

标签： javascript html xml parsing xhtml

5条回答

闹够了就滚

2楼-- · 2019-02-10 13:21

This will do a pretty good job:

result = text.replace(/(<img\b[^<>]*[^<>\/])>/ig, "$1 />");

Addendum: In the (unlikely) event that your code contains tag attributes containing angle brackets (which is not vaild XML/XHTML BTW), then this one will do a little better job:

result = text.replace(/(<img\b(?:[^<>"'\/]+|'[^']*'|"[^"]*")*)>/ig, "$1 />");

0人赞添加讨论(0) 举报

虎瘦雄心在

3楼-- · 2019-02-10 13:21

Why would u wanna fix in browser DOM a HTML document that's XHTML invalid?

It was already served and parsed and you already have DOM available. Any parsing error that an invalid/bad formed document would cause, already happened and it won't be a regex on DOM that will fix it.

Also, remember that almost all documents are parsed as HTML tag-soup. If you can't fix the document on server-side, just ignore its validity/wellformeness on client-side.

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-02-10 13:31

Don't use a Regular expression, but dedicated parsers. In JavaScript, create a document using the DOMParser, then serialize it using the XMLSerializer:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);
// result:
// <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body> (no line break)
// <img src="foo" /></body></html>

0人赞添加讨论(0) 举报

淡お忘

5楼-- · 2019-02-10 13:34

In addition to Rob W's answer, you can extract the body content using RegEx:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);

/<body>(.*)<\/body>/im.exec(result);
result = RegExp.$1;

// result:
// <img src="foo" />

Note: parseFromString(htmlString, 'text/html'); would throw error in IE9 because text/html mimeType is not supported in IE9. Works with IE10 and IE11 though.

0人赞添加讨论(0) 举报

Animai°情兽

6楼-- · 2019-02-10 13:38

You can create a xhtml document and import/adopt html elements. Html strings can be parsed by HTMLElement.innerHTML property, of cause. The key point is using Document.importNode() or Document.adoptNode() method to convert html nodes to xhtml nodes:

var di = document.implementation;
var hd = di.createHTMLDocument();
var xd = di.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
hd.body.innerHTML = '<img>';
var img = hd.body.firstElementChild;
var xb = xd.createElement('body');
xd.documentElement.appendChild(xb);
console.log('html doc:\n' + hd.documentElement.outerHTML + '\n');
console.log('xhtml doc:\n' + xd.documentElement.outerHTML + '\n');
img = xd.importNode(img); //or xd.adoptNode(img). Now img is a xhtml element
xb.appendChild(img);
console.log('xhtml doc after import/adopt img from html:\n' + xd.documentElement.outerHTML + '\n');

The output should be:

html doc:
<html><head></head><body><img></body></html>

xhtml doc:
<html xmlns="http://www.w3.org/1999/xhtml"><body></body></html>

xhtml doc after import/adopt img from html:
<html xmlns="http://www.w3.org/1999/xhtml"><body><img /></body></html>

Rob W's answer does not work in chrome (at least 29 and below) because DOMParser does not support 'text/html' type and XMLSerializer generates html syntax(NOT xhtml) for html document in chrome.

0人赞添加讨论(0) 举报

How to convert HTML to valid XHTML?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间