I am trying to extract URL from an tag, however, instead of getting https://website.com/-id1, I am getting tag link text. Here is my code:
string text="<a style=\"font - weight: bold; \" href=\"https://website.com/-id1\">MyLink</a>";
string parsed = Regex.Replace(text, " <[^>] + href =\"([^\"]+)\"[^>]*>", "$1 " );
parsed = Regex.Replace(parsed, "<[^>]+>", "");
Console.WriteLine(parsed);
The result I got was MyLink which is not what I want. I want something like
https://website.com/-id1
Any help or a link will be highly appreciated.
Regular expressions can be used in very specific, simple cases with HTML. For example, if the text contains only a single tag, you can use "href\\s*=\\s*\"(?<url>.*?)\""
to extract the URL, eg:
var url=Regex.Match(text,"href\\s*=\\s*\"(?<url>.*?)\"").Groups["url"].Value;
This pattern will return :
https://website.com/-id1
This regex doesn't do anything fancy. It looks for href=
with possible whitespace and then captures anything between the first double quote and the next in a non-greedy manner (.*?
). This is captured in the named group url
.
Anything more fancy and things get very complex. For example, supporting both single and double quotes would require special handling to avoid starting on a single and ending on a double quote. The string could multiple <a>
tags that used both types of quotes.
For complex parsing it would be better to use a library like AngleSharp or HtmlAgilityPack
Try this:
var input = "<a style=\"font - weight: bold; \" href=\"https://website.com/-id1\">MyLink</a><a style=\"font - weight: bold; \" href=\"https://website.com/-id2\">MyLink2</a>";
var r = new Regex("<a.*?href=\"(.*?)\".*?>");
var output = r.Matches(input);
var urls = new List<string>();
foreach (var item in output) {
urls.Add((item as Match).Groups[1].Value);
}
It will find all a tags and extract their href values then store it in urls List.
Explanation
<a
match begining of <a> tag
.*?href=
match anything until href=
"(.*?)"
match and capture anything inside ""
.*?>
match end of <a> tag