How can I find all Markdown links using regular ex

2019-04-07 13:16发布

问题:

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].

I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.

So far I have this:

(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])

Debuggex Demo

But this doesn't seem to match either of my two test cases in Debuggex:

http://example.com
(Example)[http://example.com]

Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.

What am I doing wrong? Or is this not doable at all?

EDIT: I'm doing this in Python so will be using their regex engine.

回答1:

The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.

You can obtain what you want in a more handy way using the new regex module of Python (since the re module has few features in comparison).

Example: (?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])

An online demo

pattern details:

(?|                                       # open a branch reset group
    # first case there is only the url
    (?<txt>                               # in this case, the text and the url  
        (?<url>                           # are the same
            (?:ht|f)tps?://\S+(?<=\P{P})
        )
    )
  |                                       # OR
    # the (text)[url] format
    \( ([^)]+) \)                         # this group will be named "txt" too 
    \[ (\g<url>) \]                       # this one "url"
)

This pattern use the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.

\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)

(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})