How to handle multiple xpath at once (based on fee

2019-06-28 01:58发布

the code below is tested and working, it prints the contents of a feed that has this structure.

<rss>
    <channel>
        <item>
            <pubDate/>
            <title/>
            <description/>
            <link/>
            <author/>
        </item>
    </channel>
</rss>

What I didn't manage to succesfully do is to print feeds that follow this structure below (the difference is on <feed><entry><published> ) even though I changed the xpath to /feed//entry. you can see the structure on the page source.

<feed>
    <entry>
        <published/>
        <title/>
        <description/>
        <link/>
        <author/>
    </entry>
</feed>

I have to say that the code sorts all item based on its pubDate. In the second structure feed I guess it should sort all entry based on its published.

I probably make a mistake on the xPath I can't find. However, if at the end of this I manage to print that feed right, how can I modify the code to handle different structures all at once ?

Is there any service that allow me to create and host my own feeds based on those feeds, so I will have the same structure to all? I hope I made my self clear... Thank you.

<?php

$feeds = array();

// Get all feed entries
$entries = array();
foreach ($feeds as $feed) {
    $xml = simplexml_load_file($feed);
    $entries = array_merge($entries, $xml->xpath(''));
}

?>

标签: php xml xslt xpath
3条回答
够拽才男人
2楼-- · 2019-06-28 02:20

This question is really two questions, "How to handle multiple xpath at once" and "[How to] create my own feeds with the same structure".

The second one has been brilliantly answered by Dimitre Novatchev. If you want to "merge" or transform one or several XML documents, that's definitely what I'd recommend.

Meanwhile, I'll take the easy path and address the first question, "How to handle multiple xpath at once". It's easy, there's an operator for that: |. If you want to query all nodes that match /feed//entry or /rss//item then you can use /feed//entry | /rss//item.

查看更多
兄弟一词,经得起流年.
3楼-- · 2019-06-28 02:34

The main contribution of this answer is a solution (at the end) that can be used with infinite number of formats, just specifying all "entry" alternative names in the external (global) parameter $postElements and all "published-date" alternative names in the external (global) parameter $pub-dateElements.

Besides this, here is how to specify an XPath expression that selects all /rss//item and all /feed//entry elements.

In the simple case of just two possible document formats this (as proposed by @Josh Davis) Xpath expression correctly works:

/rss//item  |   /feed//entry

A more general XPath expression allows the selection of the wanted elements from a set of unlimited number of document formats:

/*[contains($topElements, concat('|',name(),'|'))]
    //*[contains($postElements, concat('|',name(),'|'))]

where the variable $topElements should be substituted by a pipe-delimited string of all possible names for a top element, and $postElements should be substituted by a pipe-delimited string of all possible names for a "entry" element. We also allow the "entry" elements to be at different depths in the different document formats.

In particular, for this concrete case the XPath expression will be;

/*[contains('|feed|rss|', concat('|',name(),'|'))]
    //*[contains('|item|entry|', concat('|',name(),'|'))]

The rest of this post shows how the complete wanted processing can be done entirely in XSLT -- easily and with elegance.


I. A gentle introduction

Such processing is easy and simple with XSLT:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <myFeed>
   <xsl:apply-templates/>
  </myFeed>
 </xsl:template>

 <xsl:template match="channel|feed">
  <xsl:apply-templates select="*">
   <xsl:sort select="pubDate|published" order="descending"/>
  </xsl:apply-templates>
 </xsl:template>

 <xsl:template match="item|entry">
  <post>
    <xsl:apply-templates mode="identity"/>
  </post>
 </xsl:template>

 <xsl:template match="pubDate|published" mode="identity">
  <publicationDate>
   <xsl:apply-templates/>
  </publicationDate>
 </xsl:template>

  <xsl:template match="node()|@*" mode="identity">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*" mode="identity"/>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied to this XML document (in format 1):

<rss>
    <channel>
        <item>
            <pubDate>2011-06-05</pubDate>
            <title>Title1</title>
            <description>Description1</description>
            <link>Link1</link>
            <author>Author1</author>
        </item>
        <item>
            <pubDate>2011-06-06</pubDate>
            <title>Title2</title>
            <description>Description2</description>
            <link>Link2</link>
            <author>Author2</author>
        </item>
        <item>
            <pubDate>2011-06-07</pubDate>
            <title>Title3</title>
            <description>Description3</description>
            <link>Link3</link>
            <author>Author3</author>
        </item>
    </channel>
</rss>

and when it is applied on this equivalent document (in format 2):

<feed>
        <entry>
            <published>2011-06-05</published>
            <title>Title1</title>
            <description>Description1</description>
            <link>Link1</link>
            <author>Author1</author>
        </entry>
        <entry>
            <published>2011-06-06</published>
            <title>Title2</title>
            <description>Description2</description>
            <link>Link2</link>
            <author>Author2</author>
        </entry>
        <entry>
            <published>2011-06-07</published>
            <title>Title3</title>
            <description>Description3</description>
            <link>Link3</link>
            <author>Author3</author>
        </entry>
</feed>

in both cases the same wanted, correct result is produced:

<myFeed>
   <post>
      <publicationDate>2011-06-07</publicationDate>
      <title>Title3</title>
      <description>Description3</description>
      <link>Link3</link>
      <author>Author3</author>
   </post>
   <post>
      <publicationDate>2011-06-06</publicationDate>
      <title>Title2</title>
      <description>Description2</description>
      <link>Link2</link>
      <author>Author2</author>
   </post>
   <post>
      <publicationDate>2011-06-05</publicationDate>
      <title>Title1</title>
      <description>Description1</description>
      <link>Link1</link>
      <author>Author1</author>
   </post>
</myFeed>

II. The full solution

This can be generalized to a parameterized solution:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:param name="postElements" select=
 "'|entry|item|'"/>
 <xsl:param name="pub-dateElements" select=
  "'|published|pubDate|'"/>

  <xsl:template match="node()|@*" name="identity">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*" mode="identity"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="/">
  <myFeed>
   <xsl:apply-templates select=
   "//*[contains($postElements, concat('|',name(),'|'))]">
    <xsl:sort order="descending" select=
     "*[contains($pub-dateElements, concat('|',name(),'|'))]"/>
   </xsl:apply-templates>
  </myFeed>
 </xsl:template>

 <xsl:template match="*">
  <xsl:choose>
   <xsl:when test=
    "contains($postElements, concat('|',name(),'|'))">
    <post>
      <xsl:apply-templates/>
    </post>
   </xsl:when>
   <xsl:when test=
   "contains($pub-dateElements, concat('|',name(),'|'))">
    <publicationDate>
     <xsl:apply-templates/>
    </publicationDate>
   </xsl:when>
   <xsl:otherwise>
    <xsl:call-template name="identity"/>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>

</xsl:stylesheet>

This transformation can be used with infinite number of formats, just specifying all "entry" alternative names in the external (global) parameter $postElements and all "published-date" alternative names in the external (global) parameter $pub-dateElements.

Anyone can try this transformation to verify that when applied on the two XML documents above it again produces the same, wanted and correct result.

查看更多
爷的心禁止访问
4楼-- · 2019-06-28 02:34

Here's a solutions.

The problem is that many RSS or Atom feeds have namespaces defined which don't play nicely with SimpleXML. In the example below, I'm using str_replace to replace xmlns= to ns=. I'm then using the name of the root element to determine the type of feed (whether it's RSS or Atom).

The array_push call takes care of adding all of the entries to the $entries array which you can then use later.

$entries = array();

foreach ( $feeds as $feed )
{
  $xml = simplexml_load_string(str_replace('xmlns=', 'ns=', $feed));

  switch ( strtolower($xml->getName()) )
  {
    // Atom
    case 'feed':
      array_push($entries, $xml->xpath('/feed//entry'));

      break;

    // RSS
    case 'rss':
      array_push($entries, $xml->xpath('/rss//item'));

      break;
  }

  // Unset the namespace variable.
  unset($namespaces);
}

var_dump($entries);

Another solution could be to use Google Reader to aggregate all of your feeds and use that feed instead of all of your separate ones.

查看更多
登录 后发表回答