I got a Wikipedia-Article and I want to fetch the first z lines (or the first x chars, or the first y words, doesn't matter) from the article.
The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.
For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on.
So how would you accomplish this task? Programming language is java, but this shouldn't matter.
A solution which came to my mind was to use an xpath query but this query would be rather complicated to handle all the border-cases. [update]It wasn't that complicated, see my solution below![/update]
Thanks!
Well, when using the Wiki source itself you could just strip out all templates at the start. This might work well enough for most articles that have infoboxes or some messages at the top.
However, some articles might put the starting blurb into a template itself so that would be a little difficult there.
Another way, perhaps more reliable, would be to take the contents of the first
<p>
tag that appears directly in the article text (so not nested in a table or so). This should strip out infoboxes and other stuff at the start as those are probably (I'm not exactly sure)<table>
s or<div>
s.Generally, Wikipedia is written for human consumption with only very minimal support for anything semantic. That makes automatic extraction of specific information from the articles pretty painful.
For example if you have the result in a string you would find the text:
and after that index you would find the first
that would be the index of the first paragraph you mentioned.
try this url Link to the content (just works in the browser)
You don't need to.
The API's
exintro
parameter returns only the first (zeroth) section of the article.Example: api.php?action=query&prop=extracts&exintro&explaintext&titles=Albert%20Einstein
There are other parameters, too:
exchars
Length of extracts in characters.exsentences
Number of sentences to return.exintro
Return only zeroth section.exsectionformat
What section heading format to use for plaintext extracts:exlimit
Maximum number of extracts to return. Because excerpts generation can be slow, the limit is capped at 20 for intro-only extracts and 1 for whole-page extracts.explaintext
Return plain-text extracts.excontinue
When more results are available, use this parameter to continue.Source: https://www.mediawiki.org/wiki/Extension:MobileFrontend#prop.3Dextracts
I worked out the following solution: Using a xpath-query on the XHTML-Source-Code (I took the print-version because it is shorter, but it also works on the normal version).
This works on German and on English Wikipedia and I haven't found an article where it doesn't output the first paragraph. The solution is also quite fast, I also thought of only taking the first x chars of the xhtml, but this would render the xhtml invalid.
If someone is searching for the JAVA-Code here it is then:
use it by calling
getPlainSummary("http://de.wikipedia.org/wiki/Uma_Thurman");
Wikipedia offers an Abstracts download. While this is quite a large file (currently
2.5GB
), it offers exactly the info you want, for all articles.As you expect, you will probably have to end up parsing the source, the compiled HTML, or both. However, the Wikipedia:Lead_section may give you some indication of what to expect in well-written articles.