PHP DOMDocument : How to parse xml/rss Tags with C

2019-06-14 15:28发布

问题:

I have the below RSS to parse, something like:

<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:x-wr="http://www.w3.org/2002/12/cal/prod/Apple_Comp_628d9d8459c556fa#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:x-example="http://www.example.com/rss/x-example" xmlns:x-microsoft="http://schemas.microsoft.com/x-microsoft" xmlns:xCal="urn:ietf:params:xml:ns:xcal" version="2.0">
    <channel>
        <item>
            <title>About Apples</title>
            <author>David K. Lowie</title>
            <x-trumba:customfield name="description">This is the description about apples</xCal:customfield>
            <x-trumba:customfield name="category">Fruits,Food,Apple</xCal:customfield>
        </item>
        <item>
            <title>About Oranges</title>
            <author>Marry L. Jones</title>
            <x-trumba:customfield name="description">This is the description about oranges</xCal:customfield>
            <x-trumba:customfield name="category">Fruits,Food,Orange</xCal:customfield>
        </item>
    </channel>
</rss>

In PHP, I only know how to read first two nodes, something like:

$rss = new DOMDocument();
$rss->load( "http://www.example.com/books.rss" );

foreach( $rss->getElementsByTagName("item") as $node ) {
    echo $node->getElementsByTagName("title")->item(0)->nodeValue,
    echo $node->getElementsByTagName("author")->item(0)->nodeValue,
}

But, these ones are the problems:

<x-trumba:customfield name="description">This is the description about apples</xCal:customfield>
<x-trumba:customfield name="category">Fruits,Food,Apple</xCal:customfield>

Please help:

  • How to parse the last nodes like <x-trumba:customfield name="description"> ?

(I can't change the RSS source since it's not under my control.)

Please kindly help.

回答1:

You XML is invalid, the 'x-trumba' prefix is not defined, and the closing tags of the elements use the 'xCal' prefix, refering to urn:ietf:params:xml:ns:xcal.

So replacing the prefix of the opening tags with 'xCal' and fixing the closing tags for 'author' makes the XML valid.

Then it is possible to register the xCalendar namespace and use Xpath to fetch the custom field contents:

$rss = new DOMDocument();
$rss->load( "http://www.example.com/books.rss" );
$xpath = new DOMXpath($rss);
$xpath->registerNamespace('x', 'urn:ietf:params:xml:ns:xcal');

foreach( $xpath->evaluate("//item") as $item ) {
    echo $xpath->evaluate('string(title)', $item), "\n";
    echo $xpath->evaluate('string(x:customfield[@name="description"])', $item), "\n";
}

Output:

About Apples
This is the description about apples
About Oranges
This is the description about oranges

The Xpath expression use a condition ([@name="description"]) to filter the customfield element nodes.