I wrote a xml grabber to receive/decode xml files from website. It works fine mostly but it always return error:
"The remote server returned an error: (403) Forbidden."
for site http://w1.weather.gov/xml/current_obs/KSRQ.xml
My code is:
CookieContainer cookies = new CookieContainer();
HttpWebRequest webRequest = (HttpWebRequest)HttpWebRequest.Create(Path);
webRequest.Method = "GET";
webRequest.CookieContainer = cookies;
using (HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse())
{
using (StreamReader streamReader = new StreamReader(webResponse.GetResponseStream()))
{
string xml = streamReader.ReadToEnd();
xmldoc.LoadXml(xml);
}
}
And the exception is throw in GetResponse method. How can I find out what happened?
It could be that your request is missing a header that is required by the server. I requested the page in a browser, recorded the exact request using Fiddler and then removed the
User-Agent
header and reissued the request. This resulted in a 403 response.This is often used by servers in an attempt to prevent scripting of their sites just like you are doing ;o)
In this case, the server header in the 403 response is "AkamaiGHost" which indicates an edge node from some cloud security solution from Akamai. Maybe a WAF rule to prevent bots is triggering the 403.
It seems like adding any value to the
User-Agent
header will work for this site. For example I set it to "definitely-not-a-screen-scraper" and that seems to work fine.In general, when you have this kind of problem it very often helps to look at the actual HTTP requests and responses using browser tools or a proxy like Fiddler. As Scott Hanselman says
http://www.hanselman.com/blog/TheInternetIsNotABlackBoxLookInside.aspx
In my particular case, it was not the UserAgent header, but the Accept header that the server didn't like.
You can use the browsers network tab of dev tools to see what the correct headers should be.
Most likely you don't have permissions to access the resource you are trying to reach. So you need to acquire necessary credentials to complete the required action.
Clearly, the URL works from a browser. It just doesn't work from the code. It would appear that the server is accepting/rejecting requests based on the user agent, probably as a very basic way of trying to prevent crawlers.
To get through, just set the
UserAgent
property to something it will recognize, for instance:That does seem to work.
Is your request going through a proxy server? If yes, add the following line before your
GetResponse()
call.