Wikipedia articles may have Infobox templates. By the following call I can get the first section of an article which includes Infobox.
http://en.wikipedia.org/w/api.php?action=parse&pageid=568801§ion=0&prop=wikitext
What I want is a query which will return only Infobox data. Is this possible?
You can do it with a url call to the Wikipedia API like this:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0
Replace the titles=
section with your page title, and format=xmlfm
to format=json
if you want the article in json format.
Instead of parsing infoboxes yourself, which is quite complicated, take a look at DBPedia, which has Wikipedia infoboxes extracted out as database objects.
Building on @garry's answer, you can have wikipedia parse the info box into html for you via the rvparse
parameter like so:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0&rvparse
Note that neither method will return just the info box. But from the html content, you can extract (via, e.g., beautifulsoup) the table
with class infobox
.
In Python
, you do something like the following
resp = requests.get(url).json()
page_one = next(iter(resp['query']['pages'].values()))
revisions = page_one.get('revisions', [])
html = next(iter(revisions[0].values()))
# now parse the html
If the page has a right side infobox, then use this URL to obtain it in txt form.
My example is using the element Hydrogen. All you need to do is replace "Hydrogen" with your title.
https://en.wikipedia.org/w/index.php?action=raw&title=Template:Infobox%20hydrogen
If you are looking for JSON format use this URL, but its not pretty.
https://en.wikipedia.org/w/api.php?action=parse&page=Template:Infobox%20hydrogen&format=json