可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to replace Urls contained inside a HTML code block the users post into an old web-app with proper anchors (<A>) for those Urls.

The problem is that Urls can be already 'anchored', that is contained in <A> elements. Those Url should not be replaced.

Example:

  <a href="http://noreplace.com">http://noreplace.com</a>         <- do not replace
  <a href="http://noreplace.com"><u>http://noreplace.com</u></a>  <- do not replace
  <a href="...">...</a>http://replace.com                         <- replace

What would the regex to match only 'not anchored Urls' look like?

I use the following function to replace with RegEx:

Function ReplaceRegExp(strString, strPattern, strReplace)

    Dim RE: Set RE = New RegExp

    With RE
        .Pattern = strPattern
        .IgnoreCase = True
        .Global = True
        ReplaceRegExp = .Replace(strString, strReplace)
    End With

End Function

The following non greedy regex is used to format UBB URLs. Can this regex be adapted to match only the ones I need?

' the double doublequote in the brackets is because
' double doublequoting is ASP escaping for doublequotes
strString = ReplaceRegExp(strString, "\[URL=[""]?(http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?[""]?\](.*?)\[/URL\]", "<a href=""$1$2$3$5"" target=""_blank"">$6</a>")

If this really cannot be done with RegEx, what would be the solution in ASP Classic, with some code or pseudocode please? However I would really like to keep code simple with an additional regex line than add additional functions to this old code.

Thanks for your effort!

回答1:

Seems like regular expressions are too complex to use for this kind of job so I went to my rusty VBScript skills and coded a function that first removes anchors and then replaces the URLs.

Here it is if somebody may need it:

Function Linkify(Text)

    Dim regEx, Match, Matches, patternURLs, patternAnchors, lCount, anchorCount, replacements

    patternURLs = "((http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)"
    patternAnchors = "<a[^>]*?>.*?</a>"

    Set replacements=Server.CreateObject("Scripting.Dictionary")

    ' Create the regular expression.
    Set regEx = New RegExp
    regEx.Pattern = patternAnchors
    regEx.IgnoreCase = True
    regEx.Global = True

    ' Do the search for anchors.
    Set Matches = regEx.Execute(Text)

    lCount = 0

    ' Iterate through the existing anchors and replace with a placeholder
    For Each Match in Matches
      key = "<#" & lCount & "#>"
      replacements.Add key, Match.Value
      Text = Replace(Text,Cstr(Match.Value),key)
      lCount = lCount+1
    Next

    anchorCount = lCount

    ' we now search for URls
    regEx.Pattern = patternURLs

    ' create anchors from URLs
    Text = regEx.Replace(Text, "<a href=""$1"">$1</a>")

    ' put back the originally existing anchors
    For lCount = 0 To anchorCount-1
        key = "<#" & lCount & "#>"
        Text = Replace(Text,key, replacements.Item(key))
    Next

    Linkify = Text

End Function

回答2:

The answer you're looking for is in negative and positive look aheads and look behinds

This article gives a pretty good overview: http://www.regular-expressions.info/lookaround.html

Here's the Regular Expression I've formulated for your case:

(?<!"|>)(ht|f)tps?://.*?(?=\s|$)

Here's some sample data I matched against:

#Matches
http://www.website.com
https://www.website.com
This is a link http://www.website.com that is not linked
This is a long link http://www.website.com/index.htm?foo=bar
ftp://www.website.com

#No Matches
<u>http://www.website.com</u>
<a href="http://www.website.com">http://website.com</a>
<a href="https://www.website.com">http://website.com</a>
<a href="http://www.website.com"><u>http://www.website.com</u></a>
<a href="ftp://www.website.com">ftp://www.website.com</a>

Here's a breakdown of what the regular expression is doing:

(?<!"|>) A negative look behind, making sure what matches next isn't preceded by a " or >

(ht|f)tps?://.*? This looks for http, https, or ftp and anything following it. It'll also match ftps! If you want to avoid this, you could use (https?|ftp)://.*? instead

(?=\s|$) This is a positive look ahead, which matches a space or end of line.

EXTRA CREDIT

(ht)?(?(1)tps?|ftp):// This will match http/https/ftp but not ftps, this may be a bit overkill when you can use (https?|ftp):// but it's an awesome example of if/else in regex.

回答3:

Some design issues you're going to have to work around:

Embedded URLs could be absolute or relative and may not include the protocol.
Your HTML may not have quotes around attribute values.
The character right after a URL may also be a valid URL character.
There are lots of valid URL characters these days.

If you can assume (1) absolute URLs with protocols and (2) quoted HTML attributes and (3) people will have whitespace after a URL and (4) you're sticking with supporting only basic URL characters, you can just look for URLs not preceded by a double-quote.

Here's an overly-simple example to start with (untested):

(?<!")((http|https|ftp)://[^\s<>])(?=\s|$)  replaced with <a href="$1">$1</a>

The [^\s<>] part above is ridiculously greedy, so all of the fun will be in tweaking that to build a character set that fits the URLs your users are typing in. Your example shows a much more involved character class with \w plus a hodge-podge of other allowed characters, so you could start there if you want.