I'm mirroring some internal websites for backup purposes. As of right now I basically use this c# code:
System.Net.WebClient client = new System.Net.WebClient();
byte[] dl = client.DownloadData(url);
This just basically downloads the html and into a byte array. This is what I want. The problem however is that the links within the html are most of the time relative, not absolute.
I basically want to append whatever the full http://domain.is before the relative link as to convert it to an absolute link that will redirect to the original content. I'm basically just concerned with href= and src=. Is there a regex expression that will cover some of the basic cases?
Edit [My Attempt]:
public static string RelativeToAbsoluteURLS(string text, string absoluteUrl)
{
if (String.IsNullOrEmpty(text))
{
return text;
}
String value = Regex.Replace(
text,
"<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>",
"<$1$2=\"" + absoluteUrl + "$3\"$4>",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
return value.Replace(absoluteUrl + "/", absoluteUrl);
}
Just use this function
The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:
The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.
Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.
You should use HtmlAgility pack to load the HTML, access all the hrefs using it, and then use the Uri class to convert from relative to absolute as necessary.
See for example http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/
this is what you are looking for, this code snippet can convert all the relative URLs to absolute inside any HTML code:
I think url is of type string. Use Uri instead with a base uri pointing to your domain:
Simple function