WebRequest: How to find a postal code using a WebR

2020-06-23 07:56发布

问题:

I first posted this: HttpWebRequest: How to find a postal code at Canada Post through a WebRequest with x-www-form-enclosed?.

Following AnthonyWJones suggestions, I changed my code following his suggestions.

On a continuation of my inquiry, I have noticed with time that the content-type of Canada Post is more likely to be "application/xhtml+xml, text/xml, text/html; charset=utf-8".

My questions are:

  1. How do we webrequest against such a content-type website?
  2. Do we have to keep on going with the NameValueCollection object?
  3. According to Scott Lance who generously provided me with precious information within my preceding question, the WebRequest shall return the type of information whatever the content-type may be, am I missing something here?
  4. Do I have to change my code because of the content-type change?

Here is my code so that it might be easier to understand my progress.

internal class PostalServicesFactory {
/// <summary>
/// Initializes an instance of GI.BusinessSolutions.Services.PostalServices.Types.PostalServicesFactory class.
/// </summary>
internal PostalServicesFactory() {
}
/// <summary>
/// Finds a Canadian postal code for the provided Canadian address.
/// </summary>
/// <param name="address">The instance of GI.BusinessSolutions.Services.PostalServices.ICanadianCityAddress for which to find the postal code.</param>
/// <returns>The postal code found, otherwise null.</returns>
internal string FindPostalCode(ICanadianCityAddress address) {
    if (address == null)
        throw new InvalidOperationException("No valid address specified.");

    using (ServicesWebClient swc = new ServicesWebClient()) {
        var values = new System.Collections.Specialized.NameValueCollection();

        values.Add("streetNumber", address.StreetNumber.ToString());
        values.Add("numberSuffix", address.NumberSuffix);
        values.Add("suite", address.Suite);
        values.Add("streetName", address.StreetName);
        values.Add("streetDirection", address.StreetDirection);
        values.Add("city", address.City);
        values.Add("province", address.Province);

        byte[] resultData = swc.UploadValues(@"http://www.canadapost.ca/cpotools/apps/fpc/personal/findByCity", "POST", values);

        return Encoding.UTF8.GetString(resultData);
    }
}

private class ServicesWebClient : WebClient {
    public ServicesWebClient()
        : base() {
    }
    protected override WebRequest GetWebRequest(Uri address) {
        var request = (HttpWebRequest)base.GetWebRequest(address);
        request.CookieContainer = new CookieContainer();
        return request;
    }
}
}

This code actually returns the HTML source code of the form one must fill with the required information in order to process with the postal code search. What I wish is to get the HTML source code or whatever it may be with the found postal code.

EDIT: Here's the WebException I get now: "Unable to send a content body with this type of verb." (This is a translation from the French exception "Impossible d'envoyer un corps de contenu avec ce type de verbe.")

Here's my code:

    internal string FindPostalCode(string url, ICanadianAddress address) {
    string htmlResult = null;

    using (var swc = new ServiceWebClient()) {
        var values = new System.Collections.Specialized.NameValueCollection();

        values.Add("streetNumber", address.StreetNumber.ToString());
        values.Add("numberSuffix", address.NumberSuffix);
        values.Add("suite", address.Suite);
        values.Add("streetName", address.StreetName);
        values.Add("streetDirection", address.StreetDirection);
        values.Add("city", address.City);
        values.Add("province", address.Province);

        swc.UploadValues(url, @"POST", values);
        string redirectUrl = swc.ResponseHeaders.GetValues(@"Location")[0];
        => swc.UploadValues(redirectUrl, @"GET", values);
    }

    return htmlResult;
}

The line that causes the exception is pointed with "=>". It seems that I can't use GET as the method, yet this is what has been told me me to do...

Any idea what I'm missing here? I try to do what Justin (see answer) recommended me to do.

Thanks in advance for any help! :-)

回答1:

As an introduction to the world of screen-scraping, you've picked a very hard case! Canada post's lookup page works like this:

  1. the first page is a form which accepts the address values
  2. this page POSTs to a second URL.
  3. that second URL in turn redirects (using an HTTP 302 redirect) to a third URL which actually shows you the HTML response containing the postal code.

Making matters worse, the page in step #3 needs to know the cookie set in step #1. So you need to use the same CookieContainer for all three requests (although it may possibly be sufficient to send the same CookieContainer to #2 and #3 only).

Furthermore, you may need to send additional HTTP headers in these requests as well, like Accept. I suspect where you're running into problems is that HttpWebRequest by default handles redirect transparently for you-- but when it transparently redirects it may not add the right HTTP headers necessary to impersonate a browser.

The solution is to set the HttpWebRequest's AllowAutoRedirect property to false, and handle the redirect yourself. In other words, once the first request returns a redirection, you'll need to pull out the URL in the HttpWebResponse's Location: header. Then you'll need to create a new HttpWebRequest (this time a regular GET request, not a POST) for that URL. Remeber to send the same cookie! (the CookieContainer class makes this very easy)

You also may need to make an additional request (#1 in my list above) in order to set up the session cookie. If I were you, I'd assume that this is required, simply to eliminate it as a problem, and try removing that step later and see if your solution still works.

You'll want to download and use Fiddler (www.fiddlertool.com) to help you with all this. Fiddler allows you to watch the HTTP requests going over the wire, and allows you (via the request builder feature) allows you to create HTTP requests so you can see which headers are actually required.