How to scrape the first paragraph from a wikipedia

2019-04-17 11:27发布

问题:

Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar?

Is there any php library for that? I don't want to use the api because it's a bit complex.

Note: i just need that to add a widget under my pages that displays related info from Wikipedia.

回答1:

Use the following XPath expression:

/*/h:body//h:h1
  |
   /*/h:body//h:h1/following::node()
      [count(. | //h:table[@id='toc']
                  /preceding::node()
             )
      =
       count(//h:table[@id='toc']
                  /preceding::node()
             )
       ]

Here the prefix h: is bound to the XHTML namespace ("http://www.w3.org/1999/xhtml").

This transformation shows that the wanted result is really produced:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:h="http://www.w3.org/1999/xhtml"
 >
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/h:body//h:h1
  |
   /*/h:body//h:h1/following::node()
      [count(. | //h:table[@id='toc']
                  /preceding::node()
             )
      =
       count(//h:table[@id='toc']
                  /preceding::node()
             )
       ]
  "/>
 </xsl:template>
</xsl:stylesheet>

When run on the XHTML document of the Wikipedia article ( you also need to define two entities &nbsp; and &reg; for this document), the wanted result is produced.