Get first lines of Wikipedia Article-第2页回答

I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article.

The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.

For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on.

So how would you accomplish this task? Programming language is java, but this shouldn't matter.

A solution which came to my mind was to use an xpath query but this query would be rather complicated to handle all the border-cases. [update]It wasn't that complicated, see my solution below![/update]

Thanks!

标签： parsing wikipedia wikipedia-api

9条回答

贪生不怕死

2楼-- · 2019-02-03 16:53

You need a parser that can read Wikipedia markup. Try WikiText or the parsers that come with XWiki.

That will allow you to ignore anything you don't want (headlines, tables).

0人赞添加讨论(0) 举报

再贱就再见

3楼-- · 2019-02-03 16:58

I opened the Albert Einstein article in Firefox and I clicked on View source. It's pretty easy to parse using an HTML parser. You should focus on the <p> and strip the other html from within it.

0人赞添加讨论(0) 举报

何必那么认真

4楼-- · 2019-02-03 17:00

I was also in the same need and wrote some Python code to do that.

The script downloads the wikipedia article with given name, parses it using BeautifulSoup and returns first few paragraphs.

Code is at http://github.com/anandology/sandbox/blob/master/wikisnip/wikisnip.py.

0人赞添加讨论(0) 举报

上一页 1 2

Get first lines of Wikipedia Article

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间