I've been up and down the Wikipedia API, but I can't figure out if there's a nice way to fetch the excerpt of an article (usually the first paragraph). It would be nice to get the HTML formatting of that paragraph, too.
The only way I currently see of getting something that resembles a snippet is by performing a fulltext search (example), but that's not really what I want (too short).
Is there any other way to fetch the first paragraph of a Wikipedia article than barbarically parsing HTML/WikiText?
Use this link to get the unparsed intro in xml form
"http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja"
Earlier I could get the introduction of a list of topics/articles from a category in a single page by adding iframes with src like the above link.. But now chrome is throwing this error - "Refused to display document because display forbidden by X-Frame-Options." Any way through? Pls help..
I found no way of doing this through the API, so I resorted to parsing HTML, using PHP's DOM functions. This was pretty easy, something among the lines of:
$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>
As ARAVIND VR notes, on wikis running the MobileFrontend extension — which includes Wikipedia — you can easily get an excerpt of an article via the MediaWiki API by using the prop=extracts
API query.
For example, this link will give you a short excerpt of the Stack Overflow article on Wikipedia in a JSON wrapper.
The various options to the query can be used to control the excerpt format (HTML or plain text), its maximum length (in characters and/or sentences, and optionally restricting it to the intro section of the article) and the formatting of section headings in the output. It's also possible to obtain intro extracts from more than one article in a single query.
It's possible to get only the "introduction" of the article using the API, with the parameter rvsection=0
as explained here.
Converting Wiki-text to HTML is a bit more difficult; I guess there are more complete/official methods, but this is what I ended up doing:
// remove templates (even nested)
do {
$c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
} while ($count > 0);
// remove HTML comments
$c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
// remove links
$c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
$c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
// remove footnotes
$c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
// remove leading and trailing spaces
$c = trim($c);
// convert bold and italic
$c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
$c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
// add newlines
if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);