How to prevent the doctype from being added to the

2020-04-12 07:36发布

问题:

I have been working on this tidy-up-messy-html tags with DOM, but now I realise a bigger problem,

$content = '<p><a href="#">this is a link</a></p>';

function tidy_html($content,$allowable_tags = null, $span_regex = null)
{      
    $dom = new DOMDocument();
    $dom->loadHTML($content);

        // other codes
    return $dom->saveHTML();
}

echo tidy_html($content);

It will output the entire DOM,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 
<html><body><p><a href="#">this is a link</a></p></body></html> 

but I only want something like this in the return,

<p><a href="#">this is a link</a></p>

I don't want,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> 
    <html><body>...</body></html>

Is this possible??

EDIT:

the innerHTML simulation generates some strange codes in my database, like &#13; ,  , ’

<p>Monday July 5th 10am - 3.30pm £20</p>&#13;
<p>Be one of the first visitors to the ...at this special event.Â</p>&#13;
<p>All participants will receive a free copy of the ‘Contemporary Art Kit’ produced exclusively for Art on....</p>&#13;

the innerHTML simulation,

$innerHHTML = '';
$nodeBody = $dom->getElementsByTagName('body')->item(0);
foreach($nodeBody->childNodes as $child) {
  $innerHTML .= $nodeBody->ownerDocument->saveXML($child);
}

I found out that the reason it creates the strange codes when there is a break is caused by saveXML($child)

So when I have something like this,

$content = '<p><br/><a href="#">xx</a></p>
<p><br/><a href="#">xx</a></p>';

It will return something like this,

<p><a href="#">xx</a></p>&#13;
<p><a href="#">xx</a></p>

But I want something this actually,

<p><a href="#">xx</a></p>
<p><a href="#">xx</a></p>

回答1:

If you're working on a fragment, you normally need only the body contents.

DomDocument in PHP does not offer something like innerHTML. You can simulate it however:

$innerHHTML = '';
$nodeBody = $dom->getElementsByTagName('body')->item(0);
foreach($nodeBody->childNodes as $child) {
  $innerHTML .= $nodeBody->ownerDocument->saveXML($child);
}

If you just want to repair a fragment, you can make use of the tidy library as well:

$html = tidy_repair_string($html, array('output-xhtml'=>1,'show-body-only'=>1));


回答2:

Hakre already mentioned the show-body-only option to HTML Tidy, which is probably what you want.

Ps. Here's the Tidy config file used by MediaWiki for pretty much just this purpose.