Use HttpWebRequest to download web pages without k

2019-06-02 22:43发布

问题:

Use HttpWebRequest to download web pages without key sensitive issues

回答1:

[update: I don't know why, but both examples below now work fine! Originally I was also seeing a 403 on the page2 example. Maybe it was a server issue?]

First, WebClient is easier. Actually, I've seen this before. It turned out to be case sensitivity in the url when accessing wikipedia; try ensuring that you have used the same case in your request to wikipedia.

[updated] As Bruno Conde and gimel observe, using %27 should help make it consistent (the intermittent behaviour suggest that maybe some wikipedia servers are configured differently to others)

I've just checked, and in this case the case issue doesn't seem to be the problem... however, if it worked (it doesn't), this would be the easiest way to request the page:

        using (WebClient wc = new WebClient())
        {
            string page1 = wc.DownloadString("http://en.wikipedia.org/wiki/Algeria");

            string page2 = wc.DownloadString("http://en.wikipedia.org/wiki/%27Abadilah");
        }

I'm afraid I can't think what to do about the leading apostrophe that is breaking things...



回答2:

I also got strange results ... First, the

http://en.wikipedia.org/wiki/'Abadilah

didn't work and after some failed tries it started working.

The second url,

http://en.wikipedia.org/wiki/'t_Zand_(Alphen-Chaam)

always failed for me...

The apostrophe seems to be the responsible for these problems. If you replace it with

%27

all urls work fine.



回答3:

Try escaping the special characters using Percent Encoding (paragraph 2.1). For example, a single quote is represented by %27 in the URL (IRI).



回答4:

I'm sure the OP has this sorted by now but I've just run across the same kind of problem - intermittent 403's when downloading from wikipedia via a web client. Setting a user agent header sorts it out:

client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");