How to get the result of a complex Wikipedia templ

This is a question that is a bit hard to follow but I will do my best explaining it. First, let me present an example page:

http://en.wikipedia.org/wiki/African_bush_elephant

That's a wikipedia page, a specie page in particular since it has the 'taxobox' to the right. I'm trying to parse the attributes in that taxobox using PHP. There's two ways in Wikipedia to create such a taxobox: manually, or by using the special "auto taxobox" template.

I can parse the manual one. I use Wikipedia's API to return the page's content in json format, next I use some regular expressions to get those properties.

In the case of an auto taxobox, however, the content returned is like this:

> {{automatic taxobox | name = African Bush Elephant&lt;ref
> name=MSW3&gt;{{MSW3 Proboscidea | id = 11500009 | page =
> 91}}&lt;/ref&gt; | status = VU | status_system = iucn3.1 | status_ref
> = &lt;ref name=IUCN&gt;{{IUCN2010|assessors=Blanc, J.|year=2008|version=2010.1|id=12392|title=Loxodonta
> africana|downloaded=04 April 2010}}&lt;/ref&gt; | trend = unknown |
> image = African Bush Elephant.jpg | taxon = Loxodonta africana |
> synonyms = ''Loxodonta africana africana'' | binomial = ''Loxodonta
> africana'' | binomial_authority = ([[Johann Friedrich
> Blumenbach|Blumenbach]], 1797) }}

If you'd compare this with the actual page as you would see it on Wikipedia, you'll notice several attributes are missing. For example, the property "Kingdom" is visible on the real page but not returned here. There's more properties missing like that.

This is like due to the template needing Wikipedia's server side command to transform the template into actual output. I learned that the API has an "expandtemplates" action, which you can send a snippet like the one above, and you'll get the results returned as the user would see it. I'm using this for several templates and it works, but unfortunately not for the auto taxobox template. Click this link to see what expandtemplates returns:

complete link

As you can see, the template doesn't actually expand. Instead, it shows more templates, nested and repeated several times.

So now I'm stuck trying to read these properties from pages that have the auto taxobox template. The only other direction I can think of is to not use the API and to just parse the html of the actual page. That would be doable for some properties, but others are extremely fragile to parse.

标签： php parsing mediawiki wikipedia

3条回答

Viruses.

2楼-- · 2019-05-19 01:00

This is a snippet of working php template parsing code.

The goal is to have an array ($data) that looks like:

$data[page name] = array(key1=>val1, key2=>val2...);

    $namespaceNames = "";
    $data = array();
    $sql_conn = array();

    $query = "select * from templatelinks left join page on templatelinks.tl_from=page.page_id where tl_title='speciesbox' order by page_title;";

    $sql_conn = mysql_connect('localhost', 'root', 'password');
    mysql_select_db('my_wiki');

    $result = mysql_query($query, $sql_conn);

    while($row = mysql_fetch_object($result))
    {
            $q2 = "select rev_text_id from revision where rev_page=".$row->page_id." order by rev_timestamp desc limit 1";
            if(($res2 = mysql_query($q2)) && ($row2 = mysql_fetch_object($res2)))
            {
                    $q3 = "select * from text where old_id=".$row2->rev_text_id;
                    if(($res3 = mysql_query($q3)) && ($row3 = mysql_fetch_object($res3)))
                    {
                        preg_match_all('/\{\{(?:[^{}]|(?R))*}}/', $row3->old_text, $info);

                        $kvs = explode( "|", substr($info[0][0], 0, strlen($info[0][0])-2));

                        $item = array();

                        foreach($kvs as $kv)
                        {
                                $kv = trim($kv);
                                if($kv == "") continue;
                                $eq = strpos($kv, "=");
                                if($eq === false) continue;
                                $key = trim(substr($kv, 0, $eq));
                                $val = trim(substr($kv, $eq+1));
                                $item[$key] = $val;
                        }
                        if(sizeof($item) > 0)
                        {
                               $title = str_replace("_", " ", $row->page_title);
                               $data[$title] = $item;
                        }
                   }
             }
        }


        foreach($data as $page=>$item)
        {

        }

0人赞添加讨论(0) 举报

欢心

3楼-- · 2019-05-19 01:17

Instead of reinventing the wheel, check out DBPedia, which has already extracted everything possible from Wikipedia templates and made it public in a variety of easily parsable formats.

0人赞添加讨论(0) 举报

Ridiculous、

4楼-- · 2019-05-19 01:26

Use action=parse instead of action=expandtemplates. As you've noticed, expandtemplates only expands a single level; additionally, it won't fully preprocess input (e.g, it won't successfully handle certain variable references inside templates).

0人赞添加讨论(0) 举报

How to get the result of a complex Wikipedia templ

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间