This is a question that is a bit hard to follow but I will do my best explaining it. First, let me present an example page:
http://en.wikipedia.org/wiki/African_bush_elephant
That's a wikipedia page, a specie page in particular since it has the 'taxobox' to the right. I'm trying to parse the attributes in that taxobox using PHP. There's two ways in Wikipedia to create such a taxobox: manually, or by using the special "auto taxobox" template.
I can parse the manual one. I use Wikipedia's API to return the page's content in json format, next I use some regular expressions to get those properties.
In the case of an auto taxobox, however, the content returned is like this:
> {{automatic taxobox | name = African Bush Elephant<ref
> name=MSW3>{{MSW3 Proboscidea | id = 11500009 | page =
> 91}}</ref> | status = VU | status_system = iucn3.1 | status_ref
> = <ref name=IUCN>{{IUCN2010|assessors=Blanc, J.|year=2008|version=2010.1|id=12392|title=Loxodonta
> africana|downloaded=04 April 2010}}</ref> | trend = unknown |
> image = African Bush Elephant.jpg | taxon = Loxodonta africana |
> synonyms = ''Loxodonta africana africana'' | binomial = ''Loxodonta
> africana'' | binomial_authority = ([[Johann Friedrich
> Blumenbach|Blumenbach]], 1797) }}
If you'd compare this with the actual page as you would see it on Wikipedia, you'll notice several attributes are missing. For example, the property "Kingdom" is visible on the real page but not returned here. There's more properties missing like that.
This is like due to the template needing Wikipedia's server side command to transform the template into actual output. I learned that the API has an "expandtemplates" action, which you can send a snippet like the one above, and you'll get the results returned as the user would see it. I'm using this for several templates and it works, but unfortunately not for the auto taxobox template. Click this link to see what expandtemplates returns:
As you can see, the template doesn't actually expand. Instead, it shows more templates, nested and repeated several times.
So now I'm stuck trying to read these properties from pages that have the auto taxobox template. The only other direction I can think of is to not use the API and to just parse the html of the actual page. That would be doable for some properties, but others are extremely fragile to parse.
This is a snippet of working php template parsing code.
The goal is to have an array ($data) that looks like:
$data[page name] = array(key1=>val1, key2=>val2...);
Instead of reinventing the wheel, check out DBPedia, which has already extracted everything possible from Wikipedia templates and made it public in a variety of easily parsable formats.
Use
action=parse
instead ofaction=expandtemplates
. As you've noticed,expandtemplates
only expands a single level; additionally, it won't fully preprocess input (e.g, it won't successfully handle certain variable references inside templates).