XPath in SimpleXML for default namespaces without

2019-01-11 20:15发布

I have an XML document that has a default namespace attached to it, eg

<foo xmlns="http://www.example.com/ns/1.0">
...
</foo>

In reality this is a complex XML document that conforms to a complex schema. My job is to parse out some data from it. To aid me, I have a spreadsheet of XPath. The XPath is rather deeply nested, eg

level1/level2/level3[@foo="bar"]/level4[@foo="bar"]/level5/level6[2]

The person who generate the XPath is an expert in the schema, so I am going with the assumption that I can't simplify it, or use object traversal shortcuts.

I am using SimpleXML to parse everything out. My problem has to do with how the default namespace gets handled.

Since there is a default namespace on the root element, I can't just do

$xml = simplexml_load_file($somepath);
$node = $xml->xpath('level1/level2/level3[@foo="bar"]/level4[@foo="bar"]/level5/level6[2]');

I have to register the namespace, assign it to a prefix, and then use the prefix in my XPath, eg

$xml = simplexml_load_file($somepath);
$xml->registerXPathNamespace('myns', 'http://www.example.com/ns/1.0');
$node = $xml->xpath('myns:level1/myns:level2/myns:level3[@foo="bar"]/myns:level4[@foo="bar"]/myns:level5/myns:level6[2]');

Adding the prefixes isn't going to be manageable in the long run.

Is there a proper way to handle default namespaces without needing to using prefixes with XPath?

Using an empty prefix doesn't work ($xml->registerXPathNamespace('', 'http://www.example.com/ns/1.0');). I can string out the default namespace, eg

$xml = file_get_contents($somepath);
$xml = str_replace('xmlns="http://www.example.com/ns/1.0"', '', $xml);
$xml = simplexml_load_string($xml);

but that is skirting the issue.

3条回答
不美不萌又怎样
2楼-- · 2019-01-11 20:34

From a bit of reading online, this is not restricted to any particular PHP or other library, but to XPath itself - at least in XPath version 1.0

XPath 1.0 does not include any concept of a "default" namespace, so regardless of how the element names appear in the XML source, if they have a namespace bound to them, the selectors for them must be prefixed in basic XPath selectors of the form ns:name. Note that ns is a prefix defined within the XPath processor, not by the document being processed, so has no relationship to how xmlns attributes are used in the XML representation.

See e.g. this "common XSLT mistakes" page, talking about the closely related XSLT 1.0:

To access namespaced elements in XPath, you must define a prefix for their namespace. [...] Unfortunately, XSLT version 1.0 has no concept similar to a default namespace; therefore, you must repeat namespace prefixes again and again.

According to an answer to a similar question, XPath 2.0 does include a notion of "default namespace", and the XSLT page linked above mentions this also in the context of XSLT 2.0.

Unfortunately, all of the built-in XML extensions in PHP are built on top of the libxml2 and libxslt libraries, which support only version 1.0 of XPath and XSLT.

So other than pre-processing the document not to use namespaces, your only option would be to find an XPath 2.0 processor that you could plug in to PHP.

(As an aside, it's worth noting that if you have unprefixed attributes in your XML document, they are not technically in the default namespace, but rather in no namespace at all; see XML Namespaces and Unprefixed Attributes for discussion of this oddity of the Namespace spec.)

查看更多
仙女界的扛把子
3楼-- · 2019-01-11 20:42

In order to avoid hacks like the str_replace one you have there (and I would recommend avoiding that), you can run the XML files through an XSLT to strip out the namespace:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:myns="http://www.example.com/ns/1.0">
  <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>

  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()" />
    </xsl:copy>
  </xsl:template>

  <xsl:template match="myns:*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>

When run on either of these inputs:

<foo xmlns="http://www.example.com/ns/1.0">
  <a>
    <child attr="5"></child>
  </a>
</foo>

<ex:foo xmlns:ex="http://www.example.com/ns/1.0">
  <ex:a>
    <ex:child attr="5"></ex:child>
  </ex:a>
</ex:foo>

The output is the same:

<foo>
  <a>
    <child attr="5" />
  </a>
</foo>

This would allow you to use your prefix-less XPaths on the result.

查看更多
ら.Afraid
4楼-- · 2019-01-11 20:51

Is there a proper way to handle default namespaces without needing to using prefixes with XPath?

No. The proper way to handle any namespace is to associate some value (a prefix) with that namespace so that it can be explicitly selected in the XPath expression. The default namespace is no different.

Think about it this way: an element in some namespace and another element with the same name in some other namespace (or no namespace at all) are different elements. They could mean (i.e. represent) different things. That's the whole point. You need to tell XPath which one you want to select. Without it, XPath doesn't know what you're asking for.

Adding the prefixes isn't going to be manageable in the long run.

I really don't see why. Whatever creates the XPath expression should be capable of specifying a proper XPath expression (or it's a broken tool).

You might be thinking, "why can't I just ignore the namespace and get all elements matching that name?" There are really hacky ways to do this (like the XSLT-based answer already posted), but they are broken by design. An element in XML is identified by the combination of its namespace and local name, just as your house can be identified with a street number (the local name) in some city and state (the namespace). If I tell you that I live on 422 Main St, then you still have no idea where I live until I tell you which city and state.

You still might be thinking, "enough with the stupid analogies, I really, really want to do this anyway." You can select elements with a given name across all namespaces by matching only the local name portion of the element, like this:

*[local-name()='level1']/*[local-name()='level2']
    /*[local-name()='level3' and @foo="bar"]/*[local-name()='level4' and 
        @foo="bar"]/*[local-name()='level5']/*[local-name()='level6'][2]');

Note that this does not restrict itself to the default namespace. It ignores namespaces entirely. It's ugly and I don't recommend it, but sometimes you just want to ignore what's best and get something done.

By the way, this is not PHP's fault. This is what the XPath spec requires. You have to specify a prefix to select a node in a namespace. If PHP were to allow you to do it some other way, then whatever they called it, it would no longer be XPath (according to the spec).

查看更多
登录 后发表回答