Trying to Parse Only the Images from an RSS Feed

2020-04-29 16:13发布

问题:

First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.

A small sampling of my rss feed reads like this:

 <channel>
 <atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
 <title>My Web Site</title>
 <description>My Feed</description>
 <link>http://mywebsite.com/</link>

 <image>
 <url>http://mywebsite.com/views/images/banner.jpg</url>
 <title>My Title</title>
 <link>http://mywebsite.com/</link>
 <description>Visit My Site</description>
 </image>

 <item>
 <title>Article One</title>
 <guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
 <link>http://mywebsite.com/geturl/e8c5106</link>
 <comments>http://mywebsite.com/details/e8c5106#comments</comments>     
 <pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate> 
 <category>Category 1</category>    
 <description>
      <![CDATA[<div>
      <img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0"  />  
      <ul><li>Poster: someone's name;</li>
      <li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
      <li>Rating: 5</li>
      <li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
      </description>
 </item> 
 <item>..

The image links that I want to parse out are the ones way inside each Item > Description

The code in my php file reads:

     <?php
 $xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
 $imgs = $xml->xpath('/item/description/img');
 foreach($imgs as $image) {
      echo $image->src;
 }
 ?>

Can someone please help me figure out how to configure the php code above?

Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?

Many thanks!!!

Hernando

回答1:

The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.

The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like &lt;img&gt;. (I went into more technical details on another answer.)

So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.

$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');

$descriptions = $xml->xpath('//item/description');
foreach ( $descriptions as $description_node ) {
    // The description may not be valid XML, so use a more forgiving HTML parser mode
    $description_dom = new DOMDocument();
    $description_dom->loadHTML( (string)$description_node );

    // Switch back to SimpleXML for readability
    $description_sxml = simplexml_import_dom( $description_dom );

    // Find all images, and extract their 'src' param
    $imgs = $description_sxml->xpath('//img');
    foreach($imgs as $image) {
        echo (string)$image['src'];
    }
}


回答2:

I don't have much experience with xPath, but you could try the following:

$imgs = $xml->xpath('item//img');

This will select all img-elements which are inside item-elements, regardless if there are other elements inbetween. Removing the leading slash will search for item anywhere in the documet, not just from the root. Otherwise, you'd need something like /rss/channel/item....

As for displaying the images: Just output <img>-tags followed by line-breaks, like so:

foreach($imgs as $image) {
    echo '<img src="' . $image->src . '" /><br />';
}

The preferred way would be to use CSS instead of <br>-tags, but I think they are simpler for a start.