I'm trying to replace Urls contained inside a HTML code block the users post into an old web-app with proper anchors (<A>
) for those Urls.
The problem is that Urls can be already 'anchored', that is contained in <A>
elements. Those Url should not be replaced.
Example:
<a href="http://noreplace.com">http://noreplace.com</a> <- do not replace
<a href="http://noreplace.com"><u>http://noreplace.com</u></a> <- do not replace
<a href="...">...</a>http://replace.com <- replace
What would the regex to match only 'not anchored Urls' look like?
I use the following function to replace with RegEx:
Function ReplaceRegExp(strString, strPattern, strReplace)
Dim RE: Set RE = New RegExp
With RE
.Pattern = strPattern
.IgnoreCase = True
.Global = True
ReplaceRegExp = .Replace(strString, strReplace)
End With
End Function
The following non greedy regex is used to format UBB URLs. Can this regex be adapted to match only the ones I need?
' the double doublequote in the brackets is because
' double doublequoting is ASP escaping for doublequotes
strString = ReplaceRegExp(strString, "\[URL=[""]?(http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?[""]?\](.*?)\[/URL\]", "<a href=""$1$2$3$5"" target=""_blank"">$6</a>")
If this really cannot be done with RegEx, what would be the solution in ASP Classic, with some code or pseudocode please? However I would really like to keep code simple with an additional regex line than add additional functions to this old code.
Thanks for your effort!
Some design issues you're going to have to work around:
If you can assume (1) absolute URLs with protocols and (2) quoted HTML attributes and (3) people will have whitespace after a URL and (4) you're sticking with supporting only basic URL characters, you can just look for URLs not preceded by a double-quote.
Here's an overly-simple example to start with (untested):
The
[^\s<>]
part above is ridiculously greedy, so all of the fun will be in tweaking that to build a character set that fits the URLs your users are typing in. Your example shows a much more involved character class with\w
plus a hodge-podge of other allowed characters, so you could start there if you want.Seems like regular expressions are too complex to use for this kind of job so I went to my rusty VBScript skills and coded a function that first removes anchors and then replaces the URLs.
Here it is if somebody may need it:
The answer you're looking for is in negative and positive look aheads and look behinds
This article gives a pretty good overview: http://www.regular-expressions.info/lookaround.html
Here's the Regular Expression I've formulated for your case:
Here's some sample data I matched against:
Here's a breakdown of what the regular expression is doing:
(?<!"|>)
A negative look behind, making sure what matches next isn't preceded by a " or >(ht|f)tps?://.*?
This looks for http, https, or ftp and anything following it. It'll also match ftps! If you want to avoid this, you could use(https?|ftp)://.*?
instead(?=\s|$)
This is a positive look ahead, which matches a space or end of line.EXTRA CREDIT
(ht)?(?(1)tps?|ftp)://
This will match http/https/ftp but not ftps, this may be a bit overkill when you can use(https?|ftp)://
but it's an awesome example of if/else in regex.