Regex string issue in making plain text urls click

2019-02-13 23:05发布

问题:

I need a working Regex code in C# that detects plain text urls (http/https/ftp/ftps) in a string and make them clickable by putting an anchor tag around it with same url. I have already made a Regex pattern and the code is attached below.

However, if there is already any clickable url is present in the input string then the above code puts another anchor tag over it. For example the existing substring in the below code: string sContent: "ftp://www.abc.com'>ftp://www.abc.com" has another anchor tag over it when the code below is run. Is there any way to fix it?

        string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com";

        Regex regx = new Regex("(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

        MatchCollection mactches = regx.Matches(sContent);

        foreach (Match match in mactches)
        {
            sContent = sContent.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
        }

Also, I want a Regex code to make emails as clickable with "mailto" tag. I can do it myself but the above mentioned issue of double anchor tag will also appear in it.

回答1:

I noticed in your example test string that if a duplicate link e.g. ftp://www.abc.com is in the string and is already linked then the result will be to double anchor that link. The Regular Expression that you already have and that @stema has supplied will work, but you need to approach how you replace the matches in the sContent variable differently.

The following code example should give you what you want:

string sContent = "ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc ftp://www.abc.com abbbbb http://www.abc2.com";

Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

MatchCollection matches = regx.Matches(sContent);

for (int i = matches.Count - 1; i >= 0 ; i--)
{
    string newURL = "<a href='" + matches[i].Value + "'>" + matches[i].Value + "</a>";

   sContent = sContent.Remove(matches[i].Index, matches[i].Length).Insert(matches[i].Index, newURL);
}


回答2:

Try this

Regex regx = new Regex("(?<!(?:href='|>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

It should work for your example.

(?<!(?:href='|>)) is a negative lookbehind, that means the pattern matches only if it is not preceeded by "href='" or ">".

See lookarounds on regular-expressions.info

and the especially the zero-width negative lookbehind assertion on msdn

See something similar on Regexr. I had to remove the alternation from the look behind, but .net should be able to handle it.

Update

To ensure that there are also (maybe possible) cases like "<p>ftp://www.def.com</p>" correctly handled, I improved the regex

Regex regx = new Regex("(?<!(?:href='|<a[^>]*>))(http|https|ftp|ftps)://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

The lookbehind (?<!(?:href='|<a[^>]*>)) is now checking that there is not a "href='" nor a tag starting with "

The output of the teststring

ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p>ftp://www.def.com</p> abbbbb http://www.ghi.com

is with this expression

ttt <a href='ftp://www.abc.com'>ftp://www.abc.com</a> abc <p><a href='ftp://www.def.com'>ftp://www.def.com</a></p> abbbbb <a href='http://www.ghi.com'>http://www.ghi.com</a>


回答3:

I know I arrived late to this party, but there are several problems with the regex that the existing answers don't address. First and most annoying, there's that forest of backslashes. If you use C#'s verbatim strings, you don't have to do all that double escaping. And anyway, most of the backslashes weren't needed in the first place.

Second, there's this bit: ([\\w+?\\.\\w+])+. The square brackets form a character class, and everything inside them is treated either as a literal character or a class shorthand like \w. But getting rid of the square brackets isn't enough to make it work. I suspect this is what you were trying for: \w+(?:\.\w+)+.

Third, the quantifiers at the end of the regex - ]*)? - are mismatched. * can match zero or more characters, so there's no point making the enclosing group optional. Also, that kind of arrangement can result in severe performance degradation. See this page for details.

There are other, minor problems, but I won't go into them right now. Here's the new and improved regex:

@"(?n)(https?|ftps?)://\w+(\.\w+)+([-a-zA-Z0-9~!@#$%^&*()_=+/?.:;',\\]*)(?![^<>]*+(>|</a>))"

The negative lookahead - (?![^<>]*+(>|</a>)) is what prevents matches inside tags or in the content of an anchor element. It's still very crude, though. There are several areas, like inside <script> elements, where you don't want it to match but it does. But trying to cover all the possibilities would result in a mile-long regex.



回答4:

Check out: Detect email in text using regex and Regex URL Replace, ignore Images and existing Links, just replace the regex for links, it will never replace a link inside a tag, only in contents.

http://html-agility-pack.net/?z=codeplex

Something like:


string textToBeLinkified = "... your text here ...";
const string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
Regex urlExpression = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(textToBeLinkified);

var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
    node.InnerHtml = urlExpression.Replace(node.InnerHtml, @"<a href=""$0"">$0</a>");
}
string linkifiedText = doc.DocumentNode.OuterHtml;


标签: c# .net regex url