-->

Using C# HttpClient to login on a website and scra

2019-03-14 05:27发布

问题:

I am trying to use C# and Chrome Web Inspector to login on http://www.morningstar.com and retrieve some information on the page http://financials.morningstar.com/income-statement/is.html?t=BTDPF&region=usa&culture=en-US.

I do not quite understand what is the mental process one must use to interpret the information from Web Inspector to simulate a login and simulate keeping the session and navigating to the next page to collect information.

Can someone explain or point me to a resource ?

For now, I have only some code to get the content of the home page and the login page:

public class Morningstar
{
    public async static void Ru4n()
    {
        var url = "http://www.morningstar.com/";
        var httpClient = new HttpClient();

        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
        httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");

        var response = await httpClient.GetAsync(new Uri(url));
        response.EnsureSuccessStatusCode();
        using (var responseStream = await response.Content.ReadAsStreamAsync())
        using (var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress))
        using (var streamReader = new StreamReader(decompressedStream))
        {
            //Console.WriteLine(streamReader.ReadToEnd());
        }

        var loginURL = "https://members.morningstar.com/memberservice/login.aspx";
        response = await httpClient.GetAsync(new Uri(loginURL));
        response.EnsureSuccessStatusCode();
        using (var responseStream = await response.Content.ReadAsStreamAsync())
        using (var streamReader = new StreamReader(responseStream))
        {
            Console.WriteLine(streamReader.ReadToEnd());
        }

    }

EDIT: In the end, on the advice of Muhammed, I used the following piece of code:

        ScrapingBrowser browser = new ScrapingBrowser();

        //set UseDefaultCookiesParser as false if a website returns invalid cookies format
        //browser.UseDefaultCookiesParser = false;

        WebPage homePage = browser.NavigateToPage(new Uri("https://members.morningstar.com/memberservice/login.aspx"));

        PageWebForm form = homePage.FindFormById("memberLoginForm");
        form["email_textbox"] = "example@example.com";
        form["pwd_textbox"] = "password";
        form["go_button.x"] = "57";
        form["go_button.y"] = "22";
        form.Method = HttpVerb.Post;
        WebPage resultsPage = form.Submit();

回答1:

You should simulate login process of the web site. The easiest way of this is inspecting website via some debugger (for example Fiddler).

Here is login request of the web site:

POST https://members.morningstar.com/memberservice/login.aspx?CustId=&CType=&CName=&RememberMe=true&CookieTime= HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: https://members.morningstar.com/memberservice/login.aspx
** omitted **
Cookie: cookies=true; TestCookieExist=Exist; fp=001140581745182496; __utma=172984700.91600904.1405817457.1405817457.1405817457.1; __utmb=172984700.8.10.1405817457; __utmz=172984700.1405817457.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=172984700; ASP.NET_SessionId=b5bpepm3pftgoz55to3ql4me

email_textbox=test@email.com&pwd_textbox=password&remember=on&email_textbox2=&go_button.x=36&go_button.y=16&__LASTFOCUS=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=omitted&__EVENTVALIDATION=omited

When you inspect this, you'll see some cookies and form fields like "__VIEWSTATE". You'll need the actual values of this filed to log in. You can use following steps:

  1. Make a request and scrap fields like "__LASTFOCUS", "__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "__EVENTVALIDATION"; and cookies.
  2. Create a new POST request to the same page, use CookieContainer from previous one; build a post string using scrapped fields, username and password. Post it with MIME type application/x-www-form-urlencoded.
  3. If successful use the cookies for further requests to stay logged in.

Note: You can use htmlagilitypack, or scrapysharp to scrap html. ScrapySharp provide easy to use tools for form posting forms and browsing websites.



回答2:

the mental is process is simulate a person login in the website, some logins are made using AJAX or traditional POST request, so, the first thing you need to do, is made that request like browser does, in the server response, you will get cookies, headers and other information, you need to use that info to build a new request, this are the scrappy request.

Steps are:

1) Build a request, like browser does, to authenticate yourself to the app. 2) Inspect response, and saves headers, cookies or other useful info to persisting your session with the server. 3) Make another request to server, using the info you gathered from second step. 4) Inspect response, and use data analysis algorithm or something else to extract the data.

Tips:

You are not using here javascript engine, some websites use javascript to show graphs, or execute some interation in the DOM document. In that cases, maybe you will need to use WebKit lib wrapper.