Get Text Content from mediawiki page via API

I'm quite new to MediaWiki, and now I have a bit of a problem. I have the title of some Wiki page, and I want to get just the text of a said page using api.php, but all that I have found in the API is a way to obtain the Wiki content of the page (with wiki markup). I used this HTTP request...

/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=xml&titles=test

But I need only the textual content, without the Wiki markup. Is that possible with the MediaWiki API?

标签： mediawiki wikipedia-api mediawiki-api

10条回答

不美不萌又怎样

2楼-- · 2019-01-10 09:08

Python users coming to this question might be interested in the wikipedia module (docs):

import wikpedia
wikipedia.set_lang('de')
page = wikipedia.page('Wikipedia')
print(page.content)

Every formatting, except for sections (==) is striped away.

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2019-01-10 09:10

Use action=render to get the cleanest possible page:

https://wiki.eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I?action=render

https://wiki.eclipse.org/Tip_of_the_Day/Eclipse_Tips/Now_where_was_I

0人赞添加讨论(0) 举报

手持菜刀，她持情操

4楼-- · 2019-01-10 09:15

You can get the wiki data in text format from the API by using the explaintext parameter. Plus, if you need to access many titles' information, you can get all the titles' wiki data in a single call. Use the pipe character | to separate each title. For example, this API call will return the data from both the "Google" and "Yahoo" pages:

http://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exlimit=max&explaintext&exintro&titles=Yahoo|Google&redirects=

Parameters:

explaintext: Return extracts as plain text instead of limited HTML.
exlimit=max: Return more than one result. The max is currently 20.
exintro: Return only the content before the first section. If you want the full data, just remove this.
redirects=: Resolve redirect issues.

0人赞添加讨论(0) 举报

做自己的国王

5楼-- · 2019-01-10 09:16

I don't think it is possible using the API to get just the text.

What has worked for me was to request the HTML page (using the normal URL that you would use in a browser) and strip out the HTML tags under the content div.

EDIT:

I have had good results using HTML Parser for Java. It has examples of how to strip out HTML tags under a given DIV.

0人赞添加讨论(0) 举报

太酷不给撩

6楼-- · 2019-01-10 09:17

Wiki pages without any formatting symbols wouldn't really make much sense in many cases.

You can strip out the formatting yourself, if you want, but you'll break some stuff in the process.

(Unless you are creating something like a search engine, in which case you'll only need the text parts and can ignore formatting symbols completely)

0人赞添加讨论(0) 举报

smile是对你的礼貌

7楼-- · 2019-01-10 09:20

The TextExtracts extension of the API does about what you're asking. Use prop=extracts to get a cleaned up response. For example, this link will give you cleaned up text for the Stack Overflow article. What's also nice is that it still includes section tags, so you can identify individual sections of the article.

Just to include a visible link in my answer, the above link looks like:

/api.php?format=xml&action=query&prop=extracts&titles=Stack%20Overflow&redirects=true

Edit: As Amr mentioned, TextExtracts is an extension to MediaWiki, so it won't necessarily be available for every MediaWiki site.

0人赞添加讨论(0) 举报

1 2 下一页

Get Text Content from mediawiki page via API

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间