How to combine PHP's DOMDocument with a JavaSc

2019-08-27 08:11发布

问题:

I've got a bit of a strange question here, but it's stumped me completely. As much as anything, this is because I can't think of the correct terms to search for, so this question may well be answered on StackOverflow somewhere but I can't find it.

We have a proofing system that allows us to take a page and annotate it. We can send the page to our clients and they can make notes on it before sending it back. For the most part, this works fine. The problem comes when we try to use a JavaScript template system, similar to Handlebars. We tend to have script templates on our page that look something like this:

<script type="client/template" id="foo-div">
<div>#foo#</div>
</script>

We can use that in our scripts to generate the markup within the template, replacing #foo# with the correct data.

The problem comes when we try to put that into our proofing system. Because we need to scrape the page so we can render in on our domain we use PHP's DOMDocument to parse the HTML so we can modify it easily (adding things like target="_blank" to external links etc). When we try to run our templating through DOMDocument, it parses it strangely (probably seeing it as invalid XML) and that causes issues on the page. To better illustrate that, here's an example in PHP:

<?php

error_reporting(E_ALL);
ini_set('display_errors', 1);

$html = '<!DOCTYPE html>'.
    '<html>'.
    '<head></head>'.
    '<body>'.
    '<script type="client/template" id="foo-div"><div>#foo#</div></script>'.
    '</body>'.
    '</html>';

$dom = new DOMDocument();

libxml_use_internal_errors(true);

try {
    $html = $dom->loadHTML($html);
} catch (Exception $e) {
    throw new Exception('Invalid HTML on the page has caused a parsing error');
}

if ($html === false) {
    throw new Exception('Unable to properly parse page');
}

$dom->preserveWhiteSpace = false;
$dom->formatOutput = false;

echo $dom->saveHTML();

This script produces code similar to the HTML below and doesn't seem to throw any exceptions.

<!DOCTYPE html>
<html>
<head></head>
<body><script type="client/template" id="foo-div"><div>#foo#</script></body>
</html>

My question is: does anybody know of a way that I can get PHP's DOMDocument to leave the templating script tag alone? Is there a setting or plugin that I can use to make DOMDocument see the contents of a script tag with an type attribute as plain text, much like browsers do?

Edit

I ended up going with Alf Eaton's solution or parsing the string as XML. However, not all the HTML tags were self-closed and that caused issues. I'm posting the complete solution here in-case anyone comes across the same issue:

/**
 * Inserts a new string into an old string at the specified position.
 * 
 * @param string $old_string Old string to modify.
 * @param string $new_string New string to insert.
 * @param int $position Position at which the new string should be inserted.
 * @return string Old string with new string inserted.
 * @see http://stackoverflow.com/questions/8251426/insert-string-at-specified-position
 */
function str_insert($old_string, $new_string, $position) {

    return substr($old_string, 0, $position) . $new_string .
        substr($old_string, $position);

}

/**
 * Inspects a string of HTML and closes any tags that need self-closing in order
 * to make the HTML valid XML.
 * 
 * @param string $html Raw HTML (potentially invalid XML)
 * @return string Original HTML with self-closing slashes added.
 */
function self_close($html) {

    $fixed = $html;
    $tags  = array('area', 'base', 'basefont', 'br', 'col', 'frame',
        'hr', 'img', 'input', 'link', 'meta', 'param');

    foreach ($tags as $tag) {

        $offset = 0;

        while (($offset = strpos($fixed, '<' . $tag, $offset)) !== false) {

            if (($close = strpos($fixed, '>', $offset)) !== false &&
                    $fixed[$close - 1] !== '/') {
                $fixed = str_insert($fixed, '/', $close);
            }

            $offset += 1; // Prevent infinite loops

        }

    }

    return $fixed;

}

// When parsing the original string:
$html = $dom->loadXML(self_close($html));

回答1:

If the input document is valid XML, parsing it as XML rather than HTML will preserve the contents of the <script> tags:

<?php

$html = <<<END
<!DOCTYPE html>
<html><body>
<script type="client/template" id="foo-div"><div>#foo#</div></script>
</body></html>
END;

$doc = new DOMDocument();
$doc->preserveWhiteSpace = true; // needs to be before loading, to have any effect
$doc->loadXML($html);
$doc->formatOutput = false;
print $doc->saveHTML();

// <!DOCTYPE html>
// <html><body>
// <script type="client/template" id="foo-div"><div>#foo#</div></script>
// </body></html>


回答2:

When PHP's DOMDocument parses HTML it uses some fail-safe techniques.
In the case of script tag there are two of them.

First is a special cript-tag content processing - as <script> tag can't contain any other tags, everything inside it assumed as text.

The second technique is whole html tag-autoclose hack. When the parser finds wrong-paced closing tag it tries to find nearest parent opening tag and autocloses every tag between this found open-tag and wrong-placed close-tag. If the parser can't find proper open-tag it just ignores close-tag.

You can see this if you try to parse a code like this <body><div><script type="client/template" id="foo-div"><div>#foo#</div>dfdf</script></div></body> - you'll get <body><div><script type="client/template" id="foo-div"><div>#foo#</script></div>dfdf</body> at your script out.

There is no normal way to make DOMDocument parse html5 in the way you want.
But you can use a simple hack - just substitute all open corner brackets < by &lt; or just any other unused symbol inside your script tag by the regular expression. And after processing you can get all back by the same procedure.