I need to dynamically construct an XPath query for an element attribute, where the attribute value is provided by the user. I'm unsure how to go about cleaning or sanitizing this value to prevent the XPath equivalent of a SQL injection attack. For example (in PHP):
<?php
function xPathQuery($attr) {
$xml = simplexml_load_file('example.xml');
return $xml->xpath("//myElement[@content='{$attr}']");
}
xPathQuery('This should work fine');
# //myElement[@content='This should work fine']
xPathQuery('As should "this"');
# //myElement[@content='As should "this"']
xPathQuery('This\'ll cause problems');
# //myElement[@content='This'll cause problems']
xPathQuery('\']/../privateElement[@content=\'private data');
# //myElement[@content='']/../privateElement[@content='private data']
The last one in particular is reminiscent to the SQL injection attacks of yore.
Now, I know for a fact there will be attributes containing single quotes and attributes containing double quotes. Since these are provided as an argument to a function, what would be the ideal way to sanitize the input for these?
Ok, what does it do?
It encodes all occurences of & and " as & and " in the string, which should give you a safe selector for that particular use. Note that I also replaced the inner ' in the xpath with ". EDIT: It has since been pointed out that ' can be escaped as ', so you could use whichever string quoting method you prefer.
I'd create a single-element XML document using a DOM, use the DOM to set the element's text to the provided value, and then grab the text out of the DOM's string representation of the XML. This will guarantee that all of the character escaping is done properly, and not just the character escaping that I'm happening to think about offhand.
Edit: The reason I would use the DOM in situations like this is that the people who wrote the DOM have read the XML recommendation and I haven't (at least, not with the level of care they have). To pick a trivial example, the DOM will report a parse error if the text contains a character that XML doesn't allow (like #x8), because the DOM's authors have implemented section 2.2 of the XML recommendation.
Now, I might say, "well, I'll just get the list of invalid characters from the XML recommendation, and strip them out of the input." Sure. Let's just look the XML recommendation and...um, what the heck are the Unicode surrogate blocks? What kind of code do I have to write to get rid of them? Can they even get into my text in the first place?
Let's suppose I figure that out. Are there other aspects of how the XML recommendation specifies character representations that I don't know about? Probably. Will these have an impact on what I'm trying to implement? Maybe.
If I let the DOM do the character encoding for me, I don't have to worry about any of that stuff.
XPath does actually include a method of doing this safely, in that it permits variable references in the form
$varname
in expressions. The library on which PHP's SimpleXML is based provides an interface to supply variables, however this is not exposed by the xpath function in your example.As a demonstration of really how simple this can be:
That's using lxml, a python wrapper for the same underlying library as SimpleXML, with a similar xpath function. Booleans, numbers, and node-sets can also be passed directly.
If switching to a more capable XPath interface is not an option, a workaround when given external string would be something (feel free to adapt to PHP) along the lines of:
The return value can be directly inserted in your expression string. As that's not actually very readable, here is how it behaves:
Note, you can't use escaping in the form
'
outside of an XML document, nor are generic XML serialisation routines applicable. However, the XPath concat function can be used to create a string with both types of quotes in any context.PHP variant: