Fetch excerpt from Wikipedia article?

2020-02-03 07:15发布

问题:

I've been up and down the Wikipedia API, but I can't figure out if there's a nice way to fetch the excerpt of an article (usually the first paragraph). It would be nice to get the HTML formatting of that paragraph, too.

The only way I currently see of getting something that resembles a snippet is by performing a fulltext search (example), but that's not really what I want (too short).

Is there any other way to fetch the first paragraph of a Wikipedia article than barbarically parsing HTML/WikiText?

回答1:

Use this link to get the unparsed intro in xml form "http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja"

Earlier I could get the introduction of a list of topics/articles from a category in a single page by adding iframes with src like the above link.. But now chrome is throwing this error - "Refused to display document because display forbidden by X-Frame-Options." Any way through? Pls help..



回答2:

I found no way of doing this through the API, so I resorted to parsing HTML, using PHP's DOM functions. This was pretty easy, something among the lines of:

$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>


回答3:

As ARAVIND VR notes, on wikis running the MobileFrontend extension — which includes Wikipedia — you can easily get an excerpt of an article via the MediaWiki API by using the prop=extracts API query.

For example, this link will give you a short excerpt of the Stack Overflow article on Wikipedia in a JSON wrapper.

The various options to the query can be used to control the excerpt format (HTML or plain text), its maximum length (in characters and/or sentences, and optionally restricting it to the intro section of the article) and the formatting of section headings in the output. It's also possible to obtain intro extracts from more than one article in a single query.



回答4:

It's possible to get only the "introduction" of the article using the API, with the parameter rvsection=0 as explained here.

Converting Wiki-text to HTML is a bit more difficult; I guess there are more complete/official methods, but this is what I ended up doing:

// remove templates (even nested)
do {
    $c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
} while ($count > 0);
// remove HTML comments
$c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
// remove links
$c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
$c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
// remove footnotes
$c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
// remove leading and trailing spaces
$c = trim($c);
// convert bold and italic
$c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
$c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
// add newlines
if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);