RegEx : replace all Url-s that are not anchored

2019-06-05 12:20发布

问题:

I'm trying to replace Urls contained inside a HTML code block the users post into an old web-app with proper anchors (<A>) for those Urls.

The problem is that Urls can be already 'anchored', that is contained in <A> elements. Those Url should not be replaced.

Example:

  <a href="http://noreplace.com">http://noreplace.com</a>         <- do not replace
  <a href="http://noreplace.com"><u>http://noreplace.com</u></a>  <- do not replace
  <a href="...">...</a>http://replace.com                         <- replace

What would the regex to match only 'not anchored Urls' look like?

I use the following function to replace with RegEx:

Function ReplaceRegExp(strString, strPattern, strReplace)

    Dim RE: Set RE = New RegExp

    With RE
        .Pattern = strPattern
        .IgnoreCase = True
        .Global = True
        ReplaceRegExp = .Replace(strString, strReplace)
    End With

End Function

The following non greedy regex is used to format UBB URLs. Can this regex be adapted to match only the ones I need?

' the double doublequote in the brackets is because
' double doublequoting is ASP escaping for doublequotes
strString = ReplaceRegExp(strString, "\[URL=[""]?(http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?[""]?\](.*?)\[/URL\]", "<a href=""$1$2$3$5"" target=""_blank"">$6</a>")

If this really cannot be done with RegEx, what would be the solution in ASP Classic, with some code or pseudocode please? However I would really like to keep code simple with an additional regex line than add additional functions to this old code.

Thanks for your effort!

回答1:

Seems like regular expressions are too complex to use for this kind of job so I went to my rusty VBScript skills and coded a function that first removes anchors and then replaces the URLs.

Here it is if somebody may need it:

Function Linkify(Text)

    Dim regEx, Match, Matches, patternURLs, patternAnchors, lCount, anchorCount, replacements

    patternURLs = "((http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)"
    patternAnchors = "<a[^>]*?>.*?</a>"

    Set replacements=Server.CreateObject("Scripting.Dictionary")

    ' Create the regular expression.
    Set regEx = New RegExp
    regEx.Pattern = patternAnchors
    regEx.IgnoreCase = True
    regEx.Global = True

    ' Do the search for anchors.
    Set Matches = regEx.Execute(Text)

    lCount = 0

    ' Iterate through the existing anchors and replace with a placeholder
    For Each Match in Matches
      key = "<#" & lCount & "#>"
      replacements.Add key, Match.Value
      Text = Replace(Text,Cstr(Match.Value),key)
      lCount = lCount+1
    Next

    anchorCount = lCount

    ' we now search for URls
    regEx.Pattern = patternURLs

    ' create anchors from URLs
    Text = regEx.Replace(Text, "<a href=""$1"">$1</a>")

    ' put back the originally existing anchors
    For lCount = 0 To anchorCount-1
        key = "<#" & lCount & "#>"
        Text = Replace(Text,key, replacements.Item(key))
    Next

    Linkify = Text

End Function


回答2:

The answer you're looking for is in negative and positive look aheads and look behinds

This article gives a pretty good overview: http://www.regular-expressions.info/lookaround.html

Here's the Regular Expression I've formulated for your case:

(?<!"|>)(ht|f)tps?://.*?(?=\s|$)

Here's some sample data I matched against:

#Matches
http://www.website.com
https://www.website.com
This is a link http://www.website.com that is not linked
This is a long link http://www.website.com/index.htm?foo=bar
ftp://www.website.com

#No Matches
<u>http://www.website.com</u>
<a href="http://www.website.com">http://website.com</a>
<a href="https://www.website.com">http://website.com</a>
<a href="http://www.website.com"><u>http://www.website.com</u></a>
<a href="ftp://www.website.com">ftp://www.website.com</a>

Here's a breakdown of what the regular expression is doing:

(?<!"|>) A negative look behind, making sure what matches next isn't preceded by a " or >

(ht|f)tps?://.*? This looks for http, https, or ftp and anything following it. It'll also match ftps! If you want to avoid this, you could use (https?|ftp)://.*? instead

(?=\s|$) This is a positive look ahead, which matches a space or end of line.

EXTRA CREDIT

(ht)?(?(1)tps?|ftp):// This will match http/https/ftp but not ftps, this may be a bit overkill when you can use (https?|ftp):// but it's an awesome example of if/else in regex.



回答3:

Some design issues you're going to have to work around:

  • Embedded URLs could be absolute or relative and may not include the protocol.
  • Your HTML may not have quotes around attribute values.
  • The character right after a URL may also be a valid URL character.
  • There are lots of valid URL characters these days.

If you can assume (1) absolute URLs with protocols and (2) quoted HTML attributes and (3) people will have whitespace after a URL and (4) you're sticking with supporting only basic URL characters, you can just look for URLs not preceded by a double-quote.

Here's an overly-simple example to start with (untested):

(?<!")((http|https|ftp)://[^\s<>])(?=\s|$)  replaced with <a href="$1">$1</a>

The [^\s<>] part above is ridiculously greedy, so all of the fun will be in tweaking that to build a character set that fits the URLs your users are typing in. Your example shows a much more involved character class with \w plus a hodge-podge of other allowed characters, so you could start there if you want.