Python regular expression with wiki text

2019-07-07 11:13发布

问题:

I'm trying to change wikitext into normal text using Python regular expressions substitution. There are two formatting rules regarding wiki link.

  • [[Name of page]]
  • [[Name of page | Text to display]]

    (http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)

Here is some text that gives me a headache.

The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally.

The text above should be changed into:

The CD is composed almost entirely of cover versions of The Beatles songs which George Martin produced originally.

The conflict between [[ ]] and [[ | ]] grammar is my main problem. I don't need one complex regular expression. Applying multiple (maybe two) regular expression substitution(s) in sequence is ok.

Please enlighten me on this problem.

回答1:

wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)

Example: http://ideone.com/7oxuz

Note: you may also find some MediaWiki parsers in http://www.mediawiki.org/wiki/Alternative_parsers.



回答2:

You're going down the wrong path. Wiki markup is notoriously hard to parse, and there are so many exceptions, edge cases and just plain busted markup that building your own regexps to do it is near-impossible. Since you're using Python, I'd suggest mwlib, which will do the hard work for you:

http://code.pediapress.com/wiki/wiki/mwlib



回答3:

This should work:

text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)


回答4:

I came up with a regex which should do the trick. Let me know if there's anything wrong with it:

r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"

(Ick, I will never get over how ugly these things are!)

Group 1 should give you the wiki link. Group 4 should give you the link text, or None if there is no pipe.

An explanation:

  • (([^\]|]|\](?=[^\]]))*) finds all sequences of characters which are not "|" or "]]". It does this by finding all sequences of characters which are not "|" or "]" OR which are a "]" followed by a character which is not a "]".
  • (\|(([^\]]|\](?=[^\]]))*))? optionally matches a "|" followed by the same regex as above, to get the link text part. The regex is slightly-changed in that it allows "|" characters.
  • Obviously the whole thing is surrounded in \[\[ ... \]\].
  • The (?=...) notation matches a regex but doesn't consume its characters, so they can be matched subsequently. I use it so as not to consume a "|" character which may appear immediately after a "]".

Edit: I fixed the regex to allow a "]" immediately before the "|", as in [[abcd]|efgh]].