I am using a combination of XMLReader and simpleXML to parse the Posts in a WordPress export file. I realize this is a little out of the norm but, its more of backup project, so we can easily pull up one of these articles if we need it in the futre. The WP site that they were on needs to come down.
The issue I am having is that some of the nodes in the XML file are empty or contain useless values (ie. Not full posts). I need to add some string length conditions but, I'm not sure how to check for each one.
<?php
$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';
$reader = new XMLReader();
$reader->open($path_to_xml_file);
while($reader->read())
{
if($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'item')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
//echo $xml->title; //or whatever
// Take care of the articles
$newcontent = $xml->children('http://purl.org/rss/1.0/modules/content/');
$contentString = $newcontent->encoded;
$titleString = $xml->title;
echo '
<div class="article-container" id="article-' . $xml->title . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $xml->title . '</h2>
<div class="articles">' . $newcontent->encoded . '</div>
</div>';
}
}
?>
I was able to successfully check this with just simpleXML but, it was too much of a memory hog all by itself. This was my simplexml code:
<?php
$url = 'wordpress.2011.xml.gz';
$xml = new SimpleXMLElement("compress.zlib://$url", NULL, TRUE);
foreach ($xml->item as $item) :
$newcontent = $item->children('http://purl.org/rss/1.0/modules/content/');
?>
<?php
$contentString = $newcontent->encoded;
$titleString = $item->title;
if ((strlen($contentString) < 13) || (strlen($titleString) < 5)) {
echo '';
} else {
echo '
<div class="article-container" id="article-' . $item->title . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $item->title . '</h2>
<div class="articles">' . $newcontent->encoded . '</div>
</div>';
}
?>
<?php endforeach; ?>
UPDATE
With Francis' help, it is working now. Here is the code:
<?php
$path_to_xml_file = 'compress.zlib://wordpress.2011.xml.gz';
$reader = new XMLReader();
$reader->open($path_to_xml_file);
$contentNS = 'http://purl.org/rss/1.0/modules/content/';
while($reader->read()) {
if($reader->nodeType == XMLReader::ELEMENT and $reader->name == 'item') {
$doc = new DOMDocument('1.0','UTF-8');
$xml = simplexml_import_dom($doc->importNode($reader->expand(), true));
$titleString = (string) $xml->title;
$contentString = (string) $xml->children($contentNS)->encoded;
if (strlen($contentString) > 12 and strlen($titleString) > 4) {
// Be careful with your output escaping!
// This below looks like it might be wrong:
// - $titleString for an ID (use slug)
// - $titleString not escaped
// - $contentString should be escaped? not sure here.
// Have you considered using XMLWriter()?
echo '
<div class="article-container" id="article-' . $titleString . '">
<a href="#top" class="top-link">Back to the Top</a>
<h2>' . $titleString . '</h2>
<div class="articles">' . $contentString . '</div>
</div>';
} else {
echo'';
}
$reader->next(); //skip the subtrees, go to next item sibling
// we already expand()ed this so we don't need to walk it.
}
}
?>