How to get the result of a complex Wikipedia templ

2019-05-19 00:44发布

This is a question that is a bit hard to follow but I will do my best explaining it. First, let me present an example page:

http://en.wikipedia.org/wiki/African_bush_elephant

That's a wikipedia page, a specie page in particular since it has the 'taxobox' to the right. I'm trying to parse the attributes in that taxobox using PHP. There's two ways in Wikipedia to create such a taxobox: manually, or by using the special "auto taxobox" template.

I can parse the manual one. I use Wikipedia's API to return the page's content in json format, next I use some regular expressions to get those properties.

In the case of an auto taxobox, however, the content returned is like this:

> {{automatic taxobox | name = African Bush Elephant<ref
> name=MSW3>{{MSW3 Proboscidea | id = 11500009 | page =
> 91}}</ref> | status = VU | status_system = iucn3.1 | status_ref
> = <ref name=IUCN>{{IUCN2010|assessors=Blanc, J.|year=2008|version=2010.1|id=12392|title=Loxodonta
> africana|downloaded=04 April 2010}}</ref> | trend = unknown |
> image = African Bush Elephant.jpg | taxon = Loxodonta africana |
> synonyms = ''Loxodonta africana africana'' | binomial = ''Loxodonta
> africana'' | binomial_authority = ([[Johann Friedrich
> Blumenbach|Blumenbach]], 1797) }}

If you'd compare this with the actual page as you would see it on Wikipedia, you'll notice several attributes are missing. For example, the property "Kingdom" is visible on the real page but not returned here. There's more properties missing like that.

This is like due to the template needing Wikipedia's server side command to transform the template into actual output. I learned that the API has an "expandtemplates" action, which you can send a snippet like the one above, and you'll get the results returned as the user would see it. I'm using this for several templates and it works, but unfortunately not for the auto taxobox template. Click this link to see what expandtemplates returns:

complete link

As you can see, the template doesn't actually expand. Instead, it shows more templates, nested and repeated several times.

So now I'm stuck trying to read these properties from pages that have the auto taxobox template. The only other direction I can think of is to not use the API and to just parse the html of the actual page. That would be doable for some properties, but others are extremely fragile to parse.

3条回答
Viruses.
2楼-- · 2019-05-19 01:00

This is a snippet of working php template parsing code.

The goal is to have an array ($data) that looks like:

$data[page name] = array(key1=>val1, key2=>val2...);

    $namespaceNames = "";
    $data = array();
    $sql_conn = array();

    $query = "select * from templatelinks left join page on templatelinks.tl_from=page.page_id where tl_title='speciesbox' order by page_title;";

    $sql_conn = mysql_connect('localhost', 'root', 'password');
    mysql_select_db('my_wiki');

    $result = mysql_query($query, $sql_conn);

    while($row = mysql_fetch_object($result))
    {
            $q2 = "select rev_text_id from revision where rev_page=".$row->page_id." order by rev_timestamp desc limit 1";
            if(($res2 = mysql_query($q2)) && ($row2 = mysql_fetch_object($res2)))
            {
                    $q3 = "select * from text where old_id=".$row2->rev_text_id;
                    if(($res3 = mysql_query($q3)) && ($row3 = mysql_fetch_object($res3)))
                    {
                        preg_match_all('/\{\{(?:[^{}]|(?R))*}}/', $row3->old_text, $info);

                        $kvs = explode( "|", substr($info[0][0], 0, strlen($info[0][0])-2));

                        $item = array();

                        foreach($kvs as $kv)
                        {
                                $kv = trim($kv);
                                if($kv == "") continue;
                                $eq = strpos($kv, "=");
                                if($eq === false) continue;
                                $key = trim(substr($kv, 0, $eq));
                                $val = trim(substr($kv, $eq+1));
                                $item[$key] = $val;
                        }
                        if(sizeof($item) > 0)
                        {
                               $title = str_replace("_", " ", $row->page_title);
                               $data[$title] = $item;
                        }
                   }
             }
        }


        foreach($data as $page=>$item)
        {

        }
查看更多
欢心
3楼-- · 2019-05-19 01:17

Instead of reinventing the wheel, check out DBPedia, which has already extracted everything possible from Wikipedia templates and made it public in a variety of easily parsable formats.

查看更多
Ridiculous、
4楼-- · 2019-05-19 01:26

Use action=parse instead of action=expandtemplates. As you've noticed, expandtemplates only expands a single level; additionally, it won't fully preprocess input (e.g, it won't successfully handle certain variable references inside templates).

查看更多
登录 后发表回答