I'm trying to change wikitext into normal text using Python regular expressions substitution. There are two formatting rules regarding wiki link.
Here is some text that gives me a headache.
The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally.
The text above should be changed into:
The CD is composed almost entirely of cover versions of The Beatles songs which George Martin produced originally.
The conflict between [[ ]] and [[ | ]] grammar is my main problem. I don't need one complex regular expression. Applying multiple (maybe two) regular expression substitution(s) in sequence is ok.
Please enlighten me on this problem.
wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)
Example: http://ideone.com/7oxuz
Note: you may also find some MediaWiki parsers in http://www.mediawiki.org/wiki/Alternative_parsers.
You're going down the wrong path. Wiki markup is notoriously hard to parse, and there are so many exceptions, edge cases and just plain busted markup that building your own regexps to do it is near-impossible. Since you're using Python, I'd suggest mwlib, which will do the hard work for you:
http://code.pediapress.com/wiki/wiki/mwlib
This should work:
text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)
I came up with a regex which should do the trick. Let me know if there's anything wrong with it:
r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"
(Ick, I will never get over how ugly these things are!)
Group 1 should give you the wiki link. Group 4 should give you the link text, or None if there is no pipe.
An explanation:
(([^\]|]|\](?=[^\]]))*)
finds all sequences of characters which are not "|" or "]]". It does this by finding all sequences of characters which are not "|" or "]" OR which are a "]" followed by a character which is not a "]".
(\|(([^\]]|\](?=[^\]]))*))?
optionally matches a "|" followed by the same regex as above, to get the link text part. The regex is slightly-changed in that it allows "|" characters.
- Obviously the whole thing is surrounded in
\[\[
... \]\]
.
- The
(?=...)
notation matches a regex but doesn't consume its characters, so they can be matched subsequently. I use it so as not to consume a "|" character which may appear immediately after a "]".
Edit: I fixed the regex to allow a "]" immediately before the "|", as in [[abcd]|efgh]]
.