I have a file that looks like this:
<ExternalPage about="http://animation.about.com/">
<d:Title>About.com: Animation Guide</d:Title>
<d:Description>Keep up with developments in online animation for all skill levels. Download tools, and seek inspiration from online work.</d:Description>
<topic>Top/Arts/Animation</topic>
</ExternalPage>
<ExternalPage about="http://www.toonhound.com/">
<d:Title>Toonhound</d:Title>
<d:Description>British cartoon, animation and comic strip creations - links, reviews and news from the UK.</d:Description>
<topic>Top/Arts/Animation</topic>
</ExternalPage>
etc.
I'm trying to get the "about" url, as well as the nested title and description. I've tried the following code, but all I get is a bunch of dashes...
$reader = new XMLReader();
if (!$reader->open("dbpedia/links/xml.xml")) {
die("Failed to open 'xml.xml'");
}
$num=0;
while($reader->read() && $num<200) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'ExternalPage') {
$url = $reader->getAttribute('about');
while ($xml->nodeType !== XMLReader::END_ELEMENT ){
$reader->read();
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'd:Title') {
$title=$xmlReader->value;
}
elseif ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'd:Description') {
$desc=$xmlReader->value;
}
}
}
$num++;echo $url."-".$title."-".$desc."<br />";
}
$reader->close();
I'm new at xmlreader, so I'd appreciate it if someone can figure out what I'm doing wrong.
Note: I'm using xmlreader because the file is a huge one (millions of lines).
EDIT: The beginning of the file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://dmoz.org/rdf/">
<!-- Generated at 2013-02-10 00:03:45 EST from DMOZ 2.0 -->
<Topic r:id="">
<catid>1</catid>
</Topic>
<Topic r:id="Top/Arts">
<catid>381773</catid>
</Topic>
<Topic r:id="Top/Arts/Animation">
<catid>423945</catid>
<link1 r:resource="http://www.awn.com/"></link1>
<link r:resource="http://animation.about.com/"></link>
<link r:resource="http://www.toonhound.com/"></link>
<link r:resource="http://enculturation.gmu.edu/2_1/pisters.html"></link>
<link r:resource="http://www.digitalmediafx.com/Features/animationhistory.html"></link>
<link r:resource="http://www.spark-online.com/august00/media/romano.html"></link>
<link r:resource="http://www.animated-divots.net/"></link>
</Topic>
<ExternalPage about="http://www.awn.com/">
<d:Title>Animation World Network</d:Title>
<d:Description>Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide, the Animation Village, discussion forums and other useful resources.</d:Description>
<priority>1</priority>
<topic>Top/Arts/Animation</topic>
</ExternalPage>
etc
It'll take time and proper debugging to come up with working pure XMLReader code. Meanwhile try this hybrid method:
$xmlR = new XMLReader;
$xmlR->open('dbpedia/links/xml.xml');
//Skip until <ExternalPage> node
while ($xmlR->read() && $xmlR->name !== 'ExternalPage');
$loadedNS_f = false;
while ($xmlR->name === 'ExternalPage')
{
//Read the entire parent tag with children
$sxmlNode = new SimpleXMLElement($xmlR->readOuterXML());
//collect all namespaces in node recursively once; assuming all nodes are similar
if (!$loadedNS_f) {
$tagNS = $sxmlNode->getNamespaces(true);
$loadedNS_f = true;
}
$URL = (string) $sxmlNode['about'];
$dNS = $sxmlNode->children($tagNS['d']);
$Title = (string) $dNS->Title;
$Desc = (string) $dNS->Description;
$Topic = (string)$sxmlNode->topic;
var_dump($URL, $Title, $Desc, $Topic);
// Jump to next <ExternalPage> tag
$xmlR->next('ExternalPage');
}
$xmlR->close();
The reason why it is not working for you is because you only read to the start-tag of the d:Title
element and that one got no value:
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'd:Title') {
$title=$xmlReader->value;
}
You probably wanted to get the nodeValue of that DOM element, but that is not what $xmlReader->value
will return. Knowing this there are multiple ways to deal with that:
Expand the node (XMLReader::expand()
) and get the nodeValue
(quick example):
$title = $reader->expand()->nodeValue;
Process all XMLReader::TEXT (3)
and/or XMLReader::CDATA (4)
child-nodes your own (decide if a node is a child-node by looking into XMLReader::$depth
).
In any case to streamline your code you can consider to provide what you need directly, for example by creating yourself a set of functions your own or extend the XMLReader class:
class MyXMLReader extends XMLReader
{
public function readToNextElement()
{
while (
$result = $this->read()
and $this->nodeType !== self::ELEMENT
) ;
return $result;
}
public function readToNext($localname)
{
while (
$result = $this->readToNextElement()
and $this->localName !== $localname
) ;
return $result;
}
public function readToNextChildElement($depth)
{
// if the current element is the parent and
// empty there are no children to go into
if ($this->depth == $depth && $this->isEmptyElement) {
return false;
}
while ($result = $this->read()) {
if ($this->depth <= $depth) return false;
if ($this->nodeType === self::ELEMENT) break;
}
return $result;
}
public function getNodeValue($default = NULL)
{
$node = $this->expand();
return $node ? $node->nodeValue : $default;
}
}
You can then just use this extended class to do your processing:
$reader = new MyXMLReader();
$reader->open($uri);
$num = 0;
while ($reader->readToNext('ExternalPage') and $num < 200) {
$url = $reader->getAttribute('about');
$depth = $reader->depth;
$title = $desc = '';
while ($reader->readToNextChildElement($depth)) {
switch ($reader->localName) {
case 'Title':
$title = $reader->getNodeValue();
break;
case 'Description':
$desc = trim($reader->getNodeValue());
break;
}
}
$num++;
echo "#", $num, ": ", $url, " - ", $title, " - ", $desc, "<br />\n";
}
As you can see, this has dramatically made your code much more readable. Also you do not need to care each time if you read this all right.
Here's an alternate way to get to that attribute:
$string = file_get_contents($filename);
$xml = new SimpleXMLElement($string);
$result = $xml->xpath('/RDF/ExternalPage[*]/@about');
var_dump($result);