Why does XML::LibXML find no nodes for this xpath

2020-03-19 04:04发布

问题:

I'm attempting to select a node using an XPath query and I don't understand why XML::LibXML doesn't find the node when it has an xmlns atribute. Here's a script to demonstrate the issue:

#!/usr/bin/perl

use XML::LibXML; # 1.70 on libxml2 from libxml2-dev 2.6.16-7sarge1 (don't ask)
use XML::XPath;  # 1.13
use strict;
use warnings;

use v5.8.4; # don't ask

my ($xpath, $libxml, $use_namespace) = @ARGV;

my $xml = sprintf(<<'END_XML', ($use_namespace ? 'xmlns="http://www.w3.org/2000/xmlns/"' : q{}));
<?xml version="1.0" encoding="iso-8859-1"?>
<RootElement>
  <MyContainer %s>
    <MyField>
        <Name>ID</Name>
        <Value>12345</Value>
    </MyField>
    <MyField>
        <Name>Name</Name>
        <Value>Ben</Value>
    </MyField>
  </MyContainer>
</RootElement>
END_XML

my $xml_parser
    = $libxml ? XML::LibXML->load_xml(string => $xml, keep_blanks => 1)
    :           XML::XPath->new(xml => $xml);

my $nodecount = 0;
foreach my $node ($xml_parser->findnodes($xpath)) {
    $nodecount ++;
    print "--NODE $nodecount--\n"; #would use say on newer perl
    print $node->toString($libxml && 1), "\n";
}

unless ($nodecount) {
    print "NO NODES FOUND\n";
}

This script allows you to chose between the XML::LibXML parser and the XML::XPath parser. It also allows you to define an xmlns attribute on the MyContainer element or leave it off depending on the arguments passed.

The xpath expression I'm using is "RootElement/MyContainer". When I run the query using the XML::LibXML parser without the namespace it finds the node with no problem:

benb@enkidu:~$ ROC/ECG/libxml_xpath.pl 'RootElement/MyContainer' libxml
--NODE 1--
<MyContainer>
    <MyField>
        <Name>ID</Name>
        <Value>12345</Value>
    </MyField>
    <MyField>
        <Name>Name</Name>
        <Value>Ben</Value>
    </MyField>
  </MyContainer>

However, when I run it with the namespace in place it finds no nodes:

benb@enkidu:~$ ROC/ECG/libxml_xpath.pl 'RootElement/MyContainer' libxml use_namespace
NO NODES FOUND

Contrast this with the output when using the XMLL::XPath parser:

benb@enkidu:~$ ROC/ECG/libxml_xpath.pl 'RootElement/MyContainer' 0 # no namespace
--NODE 1--
<MyContainer>
    <MyField>
        <Name>ID</Name>
        <Value>12345</Value>
    </MyField>
    <MyField>
        <Name>Name</Name>
        <Value>Ben</Value>
    </MyField>
  </MyContainer>
benb@enkidu:~$ ROC/ECG/libxml_xpath.pl 'RootElement/MyContainer' 0 1 # with namespace
--NODE 1--
<MyContainer xmlns="http://www.w3.org/2000/xmlns/">
    <MyField>
        <Name>ID</Name>
        <Value>12345</Value>
    </MyField>
    <MyField>
        <Name>Name</Name>
        <Value>Ben</Value>
    </MyField>
  </MyContainer>

Which of these parser implementations is doing it "right"? Why does XML::LibXML treat it differently when I use a namespace? What can I do to retrieve the node when the namespace is in place?

回答1:

This is a FAQ. XPath considers any unprefixed name in an expression to belong to "no namespace".

Then, the expression:

RootElement/MyContainer

selects all MyContainer elements that belong to "no namespace" and are children of all RootElement elements that belong to "no namespace" and are children of the context (current node). However, there are no elements at all in the whole document that belong to "no namespace" -- all elements belong to the default namespace.

This explains the result you are getting. XML::LibXML is right.

The common solution is that the API of the hosting language allows a specific prefix to be bound to the namespace by "registering" a namespace. Then one can use an expression like:

x:RootElement/x:MyContainer

where x is the prefix with which the namespace has been registered.

In the very rare occasions where the hosting language doesn't offer registering namespaces, use the following expression:

*[name()='RootElement']/*[name()='MyContainer']


回答2:

@Dmitre is right. You need to take a look at XML::LibXML::XPathContext which will allow you to declare the namespace and then you can use namespace aware XPath statements. I gave an example of using this some time ago on stackoverflow - have a look at Why should I use XPathContext with Perl's XML::LibXML



回答3:

Using XML::LibXML 1.69.

Maybe this a XML::LibXML 1.69 thing but the strange part is that I can use the normal XPath and findnodes() and the code below prints the nodes.

use strict;
use XML::LibXML;

my $xml = <<END_XML;
<?xml version="1.0" encoding="iso-8859-1"?>
<RootElement>
   <MyContainer xmlns="http://www.w3.org/2000/xmlns/">
    <MyField>
        <Name>ID</Name>
        <Value>12345</Value>
    </MyField>
    <MyField>
        <Name>Name</Name>
        <Value>Ben</Value>
    </MyField>
  </MyContainer>
</RootElement>
END_XML

my $parser = XML::LibXML->new();

$parser->recover_silently(1);

my $doc = $parser->parse_string($xml);

my $root = $doc->documentElement();

foreach my $node ($root->findnodes('MyContainer/MyField')) {
     print $node->toString();
}

But if I change the namespace to something other than "http://www.w3.org/2000/xmlns/", then using XML::LibXML::XPathContext is required to get the same nodes to print.

use strict;
use XML::LibXML;

my $xml = <<END_XML;
<?xml version="1.0" encoding="iso-8859-1"?>
<RootElement>
  <MyContainer xmlns="http://something.org/2000/something/">
    <MyField>
        <Name>ID</Name>
        <Value>12345</Value>
    </MyField>
    <MyField>
        <Name>Name</Name>
        <Value>Ben</Value>
    </MyField>
  </MyContainer>
</RootElement>
END_XML

my $parser = XML::LibXML->new();

$parser->recover_silently(1);

my $doc = $parser->parse_string($xml);

my $root = $doc->documentElement();

my $xpc = XML::LibXML::XPathContext->new($root);

$xpc->registerNs("x", "http://something.org/2000/something/");

foreach my $node ($xpc->findnodes('x:MyContainer/x:MyField')) {
    print $node->toString();
}