PHP get img src from xml

2019-09-18 13:01发布

问题:

I have a page with xml that looks like:

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0">
  <channel>
    <title>FB-RSS feed for Salman Khan  Fc</title>
    <link>http://facebook.com/profile.php?id=1636293749919827/</link>
    <description>FB-RSS feed for Salman Khan  Fc</description>
    <managingEditor>http://fbrss.com (FB-RSS)</managingEditor>
    <pubDate>31 Mar 16 20:00 +0000</pubDate>
    <item>
      <title>Photo - Who is the Best Khan ?</title>
      <link>https://www.facebook.com/SalmanKhanFns/photos/a.1639997232882812.1073741827.1636293749919827/1713146978901170/?type=3</link>
      <description>&lt;a href=&#34;https://www.facebook.com/SalmanKhanFns/photos/a.1639997232882812.1073741827.1636293749919827/1713146978901170/?type=3&#34;&gt;&lt;img src=&#34;https://scontent.xx.fbcdn.net/hphotos-xap1/v/t1.0-0/s130x130/11059765_1713146978901170_8711054263905505442_n.jpg?oh=fa2978c5ecfb3ae424e9082aaa057b8f&amp;oe=57BB41D5&#34;&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;Who is the Best Khan ?</description>
      <author>FB-RSS</author>
      <guid>1636293749919827_1713146978901170</guid>
      <pubDate>31 Mar 16 20:00 +0000</pubDate>
    </item>
    <item>
      <title>Photo</title>
      <link>https://www.facebook.com/SalmanKhanFns/photos/a.1636293813253154.1073741825.1636293749919827/1713146755567859/?type=3</link>
      <description>&lt;a href=&#34;https://www.facebook.com/SalmanKhanFns/photos/a.1636293813253154.1073741825.1636293749919827/1713146755567859/?type=3&#34;&gt;&lt;img src=&#34;https://scontent.xx.fbcdn.net/hphotos-xap1/v/t1.0-0/s130x130/12294686_1713146755567859_6728330714340999478_n.jpg?oh=6d90a688fdf4342f9e12e9ff9a66b127&amp;oe=57778068&#34;&gt;&lt;/a&gt;&lt;br&gt;&lt;br&gt;</description>
      <author>FB-RSS</author>
      <guid>1636293749919827_1713146755567859</guid>
      <pubDate>31 Mar 16 19:58 +0000</pubDate>
    </item>
  </channel>
</rss>

I want to get the srcs of the imgs in the xml above.

The images are stored in the <description> however, they are not in the format of

<img...

they rather look like:

&lt;img src=&#34;https://scontent.xx.fbc... .

the < is replace with &lt;... I guess thats why $imgs = $dom->getElementsByTagName('img'); returns nothing.

Is there any work around?

This is how I call it:

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadXML( $xml_file);
$imgs = ...(get the imgs to extract the src...('img') ??;

//Then run a possible foreach
//something like:

foreach($imgs as $img){

   $src= ///the src of the $img

   //try it out
   echo '<img src="'.$src.'" /> <br />',
}

Any Idea?

回答1:

You have HTML embedded in XML tags, so you have to retrieve XML nodes, load each HTML and retrieve desired tag attribute.

In your XML there are different <description> nodes, so using ->getElementsByTagName will return more than your desired nodes. Use DOMXPath to retrieve only <description> nodes in the right tree position:

$dom = new DOMDocument();
libxml_use_internal_errors( True );
$dom->loadXML( $xml );
$dom->formatOutput = True;

$xpath = new DOMXPath( $dom );
$nodes = $xpath->query( 'channel/item/description' );

Then iterate all nodes, load node value in a new DOMDocument (no need to decode html entities, DOM already decodes it for you), and extract src attribute from <img> node:

foreach( $nodes as $node )
{
    $html = new DOMDocument();
    $html->loadHTML( $node->nodeValue );
    $src = $html->getElementsByTagName( 'img' )->item(0)->getAttribute('src');
}

eval.in demo