Parsing a Wikipedia dump

For example using this Wikipedia dump:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=true&format=xmlfm

Is there an existing library for Python that I can use to create an array with the mapping of subjects and values?

For example:

{height_ft,6},{nationality, American}

标签： python mediawiki wikipedia-api mediawiki-api wikimedia-dumps

8条回答

乱世女痞

2楼-- · 2019-01-11 14:18

It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's parser to produce an object representation that you can browse and analyse in code to extract the information you want. mwlib is BSD licensed.

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2019-01-11 14:19

You're probably looking for the Pywikipediabot for manipulating the wikipedia API.

0人赞添加讨论(0) 举报

We Are One

4楼-- · 2019-01-11 14:19

I would say look at using Beautiful Soup and just get the Wikipedia page in HTML instead of using the API.

I'll try and post an example.

0人赞添加讨论(0) 举报

放我归山

5楼-- · 2019-01-11 14:35

Just stumbled over a library on PyPi, wikidump, that claims to provide

Tools to manipulate and extract data from wikipedia dumps

I didn't use it yet, so you are on your own to try it...

0人赞添加讨论(0) 举报

劫难

6楼-- · 2019-01-11 14:37

I know the question is old, but I was searching for a library that parses wikipedia xml dump. However, the suggested libraries, wikidump and mwlib, don't offer many code documentation. Then, I found Mediwiki-utilities, which has some code documentation in: http://pythonhosted.org/mediawiki-utilities/.

0人赞添加讨论(0) 举报

Evening l夕情丶

7楼-- · 2019-01-11 14:37

There's some information on Python and XML libraries here.

If you're asking is there an existing library that's designed to parse Wiki(pedia) XML specifically and match your requirements, this is doubtful. However you can use one of the existing libraries to traverse the DOM and pull out the data you need.

Another option is to write an XSLT stylesheet that does similar and call it using lxml. This also lets you make calls to Python functions from inside the XSLT so you get the best of both worlds.

0人赞添加讨论(0) 举报

1 2 下一页

Parsing a Wikipedia dump

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间