Since libxml 2.9, loading external entities has been disabled when parsing XML, to prevent XXE attacks.
In that case, to be able to load a DTD file when parsing the XML with PHP's DOMDocument, LIBXML_DTDLOAD
must be specified.
What would be a good way to verify that only the expected DTD will be loaded, before enabling LIBXML_DTDLOAD
?
One approach I can think of (as shown in the example code below) would be to keep entity loading disabled, parse the XML file once, check that the DOCTYPE declaration is as expected, then parse the XML again with entity loading enabled. Would that be sufficient?
<?php
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article/>
XML;
// entity loading disabled
libxml_disable_entity_loader();
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD); // PHP Warning: DOMDocument::load(): I/O warning : failed to load external entity
print $doc->doctype->systemId; // http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd
// entity loading enabled
libxml_disable_entity_loader(false);
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);
print $doc->doctype->systemId; // http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd
What would be a good way to verify that only the expected DTD will be loaded, before enabling LIBXML_DTDLOAD
?
If you want to filter (whitelist) expected DTDs, you can do so by disabling all others by returning NULL
from your own callable that has been set as external entity loader via libxml_set_external_entity_loader
.
That is, you would use the LIBXML_DTDLOAD
flag and then resolve to a resource handle in your function in case the DTD is white-listed. In case not, you return said NULL
.
<?php
/**
* @link http://stackoverflow.com/q/24526493/367456
*/
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article/>
XML;
/* own entity loader */
libxml_set_external_entity_loader(function() {
var_dump(func_get_args()); // just for demonstrating purposes
return NULL;
});
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);
echo "----\n";
/* restore default entity loader */
libxml_set_external_entity_loader(NULL);
$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);
Example output:
array(3) {
[0]=>
string(66) "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN"
[1]=>
string(66) "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd"
[2]=>
array(4) {
["directory"]=>
string(1) "/"
["intSubName"]=>
string(7) "article"
["extSubURI"]=>
string(66) "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd"
["extSubSystem"]=>
string(66) "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN"
}
}
Warning: DOMDocument::loadXML(): Failed to load external entity "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" in Entity, line: 2 in /in/jemmH on line 18
----
Warning: DOMDocument::loadXML(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /in/jemmH on line 25
Warning: DOMDocument::loadXML(http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd): failed to open stream: php_network_getaddresses: getaddrinfo failed: Name or service not known in /in/jemmH on line 25
Notice: DOMDocument::loadXML(): failed to load external entity "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd" in Entity, line: 2 in /in/jemmH on line 25
Your approach is seems good but in order to improve performence you may want to do White Listing on DOCTYPE declarations, if it pass then you can parse it with entity loading enabled.