How can I use PHP's various XML libraries to g

2019-01-22 10:02发布

问题:

I'm writing a web application that has an XML API in PHP, and I'm worried about three specific vulnerabilities, all related to inline DOCTYPE definitions: local file inclusion, quadratic entity blowup, and exponential entity blowup. I'd love to use PHP's (5.3) built in libraries, but I want to make sure I'm not susceptible to these.

I found I can eliminate LFI with libxml_disable_entity_loader, but this doesn't help with inline ENTITY declarations, including entities that refer to other entities.

The SimpleXML library (SimpleXMLElement, simplexml_load_string, etc) is great because it's a DOM parser and all my inputs are fairly small; it allows me to use xpath and manipulate the DOM pretty easily. I can't figure how to stop ENTITY declarations. (I would be happy to disable all inline DOCTYPE definitions, if possible.)

The XML Parser library (xml_parser_create, xml_set_element_handler, etc) allows me to set the default handler, which includes entities, with xml_set_default_handler. I can hack it so for unrecognized entities it simply returns the original string (ie, "&ent;"). This library is frustrating though: because it is a SAX parser I have to write a bunch of handlers (as many as 9..).

So is it possible to use the built in libraries, get DOM-like objects out, and protect myself from these various DoS vulnerabilities? thanks

This page describes the three vulnerabilities, and provides a solution...if only I were using .NET: http://msdn.microsoft.com/en-us/magazine/ee335713.aspx

UPDATE:

<?php
$s = <<<EOF
<?xml version="1.0?>
<!DOCTYPE data [
<!ENTITY en "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa....">
]>
<data>&en;&en;&en;&en;&en;&en;&en;&en;&en;&en;&en;&en;.....</data>
EOF;
$doc = new DOMDocument();
$doc->loadXML($s);
var_dump($d->lastChild->nodeValue);
?>

I tried loadXML($s, LIBXML_NOENT); as well. In both cases I end up dumping 300+ MB. Is there something I'm still missing?

回答1:

Note: If you create test-cases with files that contain the XML chunks in the following, expect that editors might be prone to these attacks as well and might freeze/crash.

Billion laugh

<?xml version="1.0"?>
<!DOCTYPE lolz [
  <!ENTITY lol "lol">
  <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
  <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;">
  <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;">
  <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;">
  <!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;">
  <!ENTITY lol6 "&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;">
  <!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;">
  <!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;">
  <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;">
]>
<lolz>&lol9;</lolz>

When loading:

FATAL: #89: Detected an entity reference loop 1:7
... (plus six times the same = seven times total with above)
FATAL: #89: Detected an entity reference loop 14:13

Result:

<?xml version="1.0"?>

Memory usage is light, the peak not touched by DOMDocument. As this example shows 7 fatal errors, one can conclude and indeed it is so that this loads w/o errors:

<?xml version="1.0"?>
<!DOCTYPE lolz [
  <!ENTITY lol "lol">
  <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;">
  <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;">
]>
<lolz>&lol2;</lolz>

As entity substitution is not in effect and this work, let's try with

Quadratic Blowup

That is this one here, shortened for your viewing pleasure (my variants are about 27/11kb):

<?xml version="1.0"?>
<!DOCTYPE kaboom [
  <!ENTITY a "aaaaaaaaaaaaaaaaaa...">
]>
<kaboom>&a;&a;&a;&a;&a;&a;&a;&a;&a;...</kaboom>

If you use $doc->loadXML($src, LIBXML_NOENT); this does work as an attack, while I write this, the script is still loading ... . So this actually takes some time to load and consumes memory. Something you can play with your own. W/o LIBXML_NOENT it works flawlessly and fast.

But there is a caveat, if you obtain the nodeValue of a tag for example, you will get the entities expanded even if you don't use that loading flag.

A workaround for this issue is to remove the DocumentType node from the document. Note the following code:

$doc = new DOMDocument();
$doc->loadXML($s); // where $s is a Quadratic attack xml string above.
// now remove the doctype node
foreach ($doc->childNodes as $child) {
    if ($child->nodeType===XML_DOCUMENT_TYPE_NODE) {
        $doc->removeChild($child);
        break;
    }
}
// Now the following is true:
assert($doc->doctype===NULL);
assert($doc->lastChild->nodeValue==='...');
// Note that entities remain unexpanded in the output XML
// This is not so good since this makes the XML invalid.
// Better is a manual walk through all nodes looking for XML_ENTITY_NODE
assert($doc->saveXML()==="<?xml version="1.0"?>\n<kaboom>&a;&a;&a;&a;&a;&a;&a;&a;&a;...</kaboom>\n");
// however, canonicalization will produce warnings because it must resolve entities
assert($doc->C14N()===False);
// Warning will be like:
//    PHP Warning:  DOMNode::C14N(): Node XML_ENTITY_REF_NODE is invalid here 

So while this workaround will prevent an XML document from consuming resources in a DoS, it makes it easy to generate invalid XML.

Some figures (I reduced the file-size otherwise it takes too long) (code):

LIBXML_NOENT disabled                                          LIBXML_NOENT enabled

Mem: 356 184 (Peak: 435 464)                                   Mem: 356 280 (Peak: 435 464)                             
Loaded file quadratic-blowup-2.xml into string.                Loaded file quadratic-blowup-2.xml into string.          
Mem: 368 400 (Peak: 435 464)                                   Mem: 368 496 (Peak: 435 464)                             
DOMDocument loaded XML 11 881 bytes in 0.001368 secs.          DOMDocument loaded XML 11 881 bytes in 15.993627 secs.   
Mem: 369 088 (Peak: 435 464)                                   Mem: 369 184 (Peak: 435 464)                             
Removed load string.                                           Removed load string.                                     
Mem: 357 112 (Peak: 435 464)                                   Mem: 357 208 (Peak: 435 464)                             
Got XML (saveXML()), length: 11 880                            Got XML (saveXML()), length: 11 165 132                  
Got Text (nodeValue), length: 11 160 314; 11.060893 secs.      Got Text (nodeValue), length: 11 160 314; 0.025360 secs. 
Mem: 11 517 776 (Peak: 11 532 016)                             Mem: 11 517 872 (Peak: 22 685 360)                       

I have not made up my mind so far about protection strategies but now know that loading the billion laugh into PHPStorm will freeze it for example and I stopped testing the later as I didn't wanted to freeze it while writing this.



回答2:

You should actually test your application with sample documents and see if it is vulnerable.

The underlying library for php's xml libraries is libxml2. It's behavior is controlled from php mostly through optional constants which most libraries will accept as an argument when loading the xml.

You can determine your php's libxml2 version with echo LIBXML_DOTTED_VERSION;

In later versions (after 2.6), libxml2 contains entity substitution limits designed to prevent both exponential and quadratic attacks. These can be overridden with the LIBXML_PARSEHUGE option.

By default libxml2 does not load a dtd, add default attributes, or perform entity substitution. So the default behavior is to ignore dtds.

You can turn parts of this on like so:

  • LIBXML_DTDLOAD will load dtds.
  • LIBXML_NONET will disable network-loading of dtds. You should always have this on and use libxml's dtd catalog to load dtds.
  • LIBXML_DTDVALID will perform dtd validation while parsing.
  • LIBXML_NOENT will perform entity substitution.
  • LIBXML_DTDATTR will add default attributes.

So using the default settings PHP/libxml2 are probably not vulnerable to any of these issues, but the only way to know for sure is to test.