since I had this annoying issue for the 2nd time, I thought that asking would help.
Sometimes I have to get Elements from XML documents, but the ways to do this are awkward.
I’d like to know a python library that does what I want, a elegant way to formulate my XPaths, a way to register the namespaces in prefixes automatically or a hidden preference in the builtin XML implementations or in lxml to strip namespaces completely. Clarification follows unless you already know what I want :)
Example-doc:
<root xmlns="http://really-long-namespace.uri"
xmlns:other="http://with-ambivalent.end/#">
<other:elem/>
</root>
What I can do
The ElementTree API is the only builtin one (I know of) providing XPath queries. But it requires me to use “UNames.” This looks like so: /{http://really-long-namespace.uri}root/{http://with-ambivalent.end/#}elem
As you can see, these are quite verbose. I can shorten them by doing the following:
default_ns = "http://really-long-namespace.uri"
other_ns = "http://with-ambivalent.end/#"
doc.find("/{{{0}}}root/{{{1}}}elem".format(default_ns, other_ns))
But this is both {{{ugly}}} and fragile, since http…end/#
≃ http…end#
≃ http…end/
≃ http…end
, and who am I to know which variant will be used?
Also, lxml supports namespace prefixes, but it does neither use the ones in the document, nor provides an automated way to deal with default namespaces. I would still have to get one element of each namespace to retrieve it from the document. Namespace attributes are not preserved, so no way of automatically retrieving them from these, too.
There is a namespace-agnostic way of XPath queries, too, but it is both verbose/ugly and unavailable in the builtin implementation: /*[local-name() = 'root']/*[local-name() = 'elem']
What I want to do
I want to find a library, option or generic XPath-morphing function to achieve above examples by typing little more than the following…
- Unnamespaced:
/root/elem
- Namespace-prefixes from document:
/root/other:elem
…plus maybe some statements that i indeed want to use the document’s prefixes or strip the namespaces.
Further clarification: although my current use case is as simple as that, I will have to use more complex ones in the future.
Thanks for reading!
Solved
The user samplebias directed my attention to py-dom-xpath; Exactly what i was looking for. My actual code now looks like this:
#parse the document into a DOM tree
rdf_tree = xml.dom.minidom.parse("install.rdf")
#read the default namespace and prefix from the root node
context = xpath.XPathContext(rdf_tree)
name = context.findvalue("//em:id", rdf_tree)
version = context.findvalue("//em:version", rdf_tree)
#<Description/> inherits the default RDF namespace
resource_nodes = context.find("//Description/following-sibling::*", rdf_tree)
Consistent with the document, simple, namespace-aware; perfect.