If I have some xml containing things like the following mediawiki markup:
" ...collected in the 12th century, of which [[Alexander the Great]] was the hero, and in which he was represented, somewhat like the British [[King Arthur|Arthur]]"
what would be the appropriate arguments to something like:
re.findall([[__?__]], article_entry)
I am stumbling a bit on escaping the double square brackets, and getting the proper link for text like: [[Alexander of Paris|poet named Alexander]]
RegExp: \w+( \w+)+(?=]])
input
[[Alexander of Paris|poet named Alexander]]
output
poet named Alexander
input
[[Alexander of Paris]]
output
Alexander of Paris
If you are trying to get all the links from a page, of course it is much easier to use the MediaWiki API if at all possible, e.g. http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_(website).
Note that both these methods miss links embedded in templates.
Here is an example
Version 2, puts more into the regex, but as a result, changes the output:
Version 3, if you only want the link without the title.
Would give the output