How to get plain text out of wikipedia

I've been searching for about 2 months now to find a script that gets the Wikipedia description section only. (It's for a bot i'm building, not for IRC.) That is, when I say

/wiki bla bla bla

it will go to the Wikipedia page for bla bla bla, get the following, and return it to the chatroom:

"Bla Bla Bla" is the name of a song made by Gigi D'Agostino. He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"

Here is the closest I've found, but it only gets the URL:

import json
import urllib.request, urllib.parse

def google(searchfor):
  query = urllib.parse.urlencode({'q': searchfor})
  url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query

  search_response = urllib.request.urlopen(url)
  search_results = search_response.read().decode("utf8")
  results = json.loads(search_results)
  data = results['responseData']
  hits = data['results']

  if len(hits) > 0:
    return hits[0]['url']
  else:
    return "No results found."

(Python 3.1)

标签： python-3.x mediawiki wikipedia wikipedia-api mediawiki-api

11条回答

Animai°情兽

2楼-- · 2019-03-08 11:47

You can try the BeautifulSoup HTML parsing library for python,but you'll have to write a simple parser.

0人赞添加讨论(0) 举报

来，给爷笑一个

3楼-- · 2019-03-08 11:51

DBPedia is the perfect solution for this problem. Here: http://dbpedia.org/page/Metallica, look at the perfectly organised data using RDF. One can query for anything here at http://dbpedia.org/sparql using SPARQL, the query language for the RDF. There's always a way to find the pageID so as to get descriptive text but this should do for the most part.

There will be a learning curve for RDF and SPARQL for writing any useful code but this is the perfect solution.

For example, a query run for Metallica returns an HTML table with the abstract in several different languages:

<table class="sparql" border="1">
  <tr>
    <th>abstract</th>
  </tr>
  <tr>
    <td><pre>"Metallica is an American heavy metal band formed..."@en</pre></td>
  </tr>
  <tr>
    <td><pre>"Metallica es una banda de thrash metal estadounidense..."@es</pre></td>
...

SPARQL QUERY :

PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX dbres: <http://dbpedia.org/resource/>

SELECT ?abstract WHERE {
 dbres:Metallica dbpedia-owl:abstract ?abstract.
}

Change "Metallica" to any resource name (resource name as in wikipedia.org/resourcename) for queries pertaining to abstract.

0人赞添加讨论(0) 举报

聊天终结者

4楼-- · 2019-03-08 11:53

Use the MediaWiki API, which runs on Wikipedia. You will have to do some parsing of the data yourself.

For instance:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&&titles=Bla%20Bla%20Bla

means

fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (title=Main%20Page) in JSON format (format=json).

You will probably want to search for the query and use the first result, to handle spelling errors and the like.

0人赞添加讨论(0) 举报

疯言疯语

5楼-- · 2019-03-08 11:53

You can fetch just the first section using the API:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvsection=0&titles=Bla%20Bla%20Bla&rvprop=content

This will give you raw wikitext, you'll have to deal with templates and markup.

Or you can fetch the whole page rendered into HTML which has its own pros and cons as far as parsing:

http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=Bla_Bla_Bla

I can't see an easy way to get parsed HTML of the first section in a single call but you can do it with two calls by passing the wikitext you receive from the first URL back with text= in place of the page= in the second URL.

UPDATE

Sorry I neglected the "plain text" part of your question. Get the part of the article you want as HTML. It's much easier to strip HTML than to strip wikitext!

0人赞添加讨论(0) 举报

We Are One

6楼-- · 2019-03-08 11:55

You can try WikiExtractor: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor

It's for Python 2.7 and 3.3+.

0人赞添加讨论(0) 举报

太酷不给撩

7楼-- · 2019-03-08 11:56

There is also the opportunity to consume Wikipedia pages through a wrapper API like JSONpedia, it works both live (ask for the current JSON representation of a Wiki page) and storage based (query multiple pages previously ingested in Elasticsearch and MongoDB). The output JSON also include plain rendered page text.

0人赞添加讨论(0) 举报

1 2 下一页

How to get plain text out of wikipedia

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间