I've got a bit of a strange question here, but it's stumped me completely. As much as anything, this is because I can't think of the correct terms to search for, so this question may well be answered on StackOverflow somewhere but I can't find it.
We have a proofing system that allows us to take a page and annotate it. We can send the page to our clients and they can make notes on it before sending it back. For the most part, this works fine. The problem comes when we try to use a JavaScript template system, similar to Handlebars. We tend to have script templates on our page that look something like this:
<script type="client/template" id="foo-div">
<div>#foo#</div>
</script>
We can use that in our scripts to generate the markup within the template, replacing #foo#
with the correct data.
The problem comes when we try to put that into our proofing system. Because we need to scrape the page so we can render in on our domain we use PHP's DOMDocument
to parse the HTML so we can modify it easily (adding things like target="_blank"
to external links etc). When we try to run our templating through DOMDocument
, it parses it strangely (probably seeing it as invalid XML) and that causes issues on the page. To better illustrate that, here's an example in PHP:
<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);
$html = '<!DOCTYPE html>'.
'<html>'.
'<head></head>'.
'<body>'.
'<script type="client/template" id="foo-div"><div>#foo#</div></script>'.
'</body>'.
'</html>';
$dom = new DOMDocument();
libxml_use_internal_errors(true);
try {
$html = $dom->loadHTML($html);
} catch (Exception $e) {
throw new Exception('Invalid HTML on the page has caused a parsing error');
}
if ($html === false) {
throw new Exception('Unable to properly parse page');
}
$dom->preserveWhiteSpace = false;
$dom->formatOutput = false;
echo $dom->saveHTML();
This script produces code similar to the HTML below and doesn't seem to throw any exceptions.
<!DOCTYPE html>
<html>
<head></head>
<body><script type="client/template" id="foo-div"><div>#foo#</script></body>
</html>
My question is: does anybody know of a way that I can get PHP's DOMDocument
to leave the templating script
tag alone? Is there a setting or plugin that I can use to make DOMDocument
see the contents of a script
tag with an type
attribute as plain text, much like browsers do?
Edit
I ended up going with Alf Eaton's solution or parsing the string as XML. However, not all the HTML tags were self-closed and that caused issues. I'm posting the complete solution here in-case anyone comes across the same issue:
/**
* Inserts a new string into an old string at the specified position.
*
* @param string $old_string Old string to modify.
* @param string $new_string New string to insert.
* @param int $position Position at which the new string should be inserted.
* @return string Old string with new string inserted.
* @see http://stackoverflow.com/questions/8251426/insert-string-at-specified-position
*/
function str_insert($old_string, $new_string, $position) {
return substr($old_string, 0, $position) . $new_string .
substr($old_string, $position);
}
/**
* Inspects a string of HTML and closes any tags that need self-closing in order
* to make the HTML valid XML.
*
* @param string $html Raw HTML (potentially invalid XML)
* @return string Original HTML with self-closing slashes added.
*/
function self_close($html) {
$fixed = $html;
$tags = array('area', 'base', 'basefont', 'br', 'col', 'frame',
'hr', 'img', 'input', 'link', 'meta', 'param');
foreach ($tags as $tag) {
$offset = 0;
while (($offset = strpos($fixed, '<' . $tag, $offset)) !== false) {
if (($close = strpos($fixed, '>', $offset)) !== false &&
$fixed[$close - 1] !== '/') {
$fixed = str_insert($fixed, '/', $close);
}
$offset += 1; // Prevent infinite loops
}
}
return $fixed;
}
// When parsing the original string:
$html = $dom->loadXML(self_close($html));