可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Folks,

I need to accomplish some sophisticated web crawling.

The goal in simple words: Login to a page, enter some values in some text fields, click Submit, then extract some values from the retrieved page.

What is the best approach?

Some Unit testing 3rd party lib?
Manual crawling in C#?
Maybe there is a ready lib for that specifically?
Any other approach?

This needs to be done within a web app.

Your help is highly appreciated.

回答1:

Not sure how will it would work within a web applications, but did you consider giving HtmlUnit a try? I think it should work fine since it's basically a headless web browser.

Steven Sanderson has a blog post about using HtmlUnit in .NET code.

回答2:

WatiN.

http://watin.sourceforge.net/

var browser = new IE();

browser.GoTo("http://www.mywebsite.com");

browser.TextField("username").TypeText("username goes here"); // alternatively, use .Value = if you don't need to simulate keystrokes.

browser.Button(Find.ById("submitButton")).Click();

and in your asserts on the return page:

Assert.AreEqual("You are logged in as Username.", ie.TextField("username").Value); // you can essentially check any HTML tag, I just used TextField for brevity.

Edit -

After reading the edit on doing this from within a web browser, you might consider using WebRequest and the HTML Agility Pack to validate what you get back:

WebRequest:

http://msdn.microsoft.com/en-us/library/debx8sh9.aspx

HTML Agility Pack:

How to use HTML Agility pack

回答3:

I was going to say Selenium, but if you are going to do it internal I would probably do something like NUnit to write the tests and then run them from the web-app.

http://www.nunit.org/

Why within the web-app though? That's like crash testing a car within the car.

回答4:

If you're looking for something more lightweight, try SimpleBrowser for .Net - open sourced at Github.

https://github.com/axefrog/SimpleBrowser

回答5:

Surprised HTMLAgilityPack wasn't mentioned. It's by far the simplest to use.

Crawling Websites with C# and Xpath

回答6:

If you know what the form post values are supposed to be going in and coming out you could create an app in C# that uses the HttpWebRequest and post to the page and parse the results. This code is highly specialized for my own use but you should be able to tweak it around and make it do what you want. It's actually part of a bigger class that lets you add post/get items to it and then submits an http request for you.

// this is for the query string
char[] temp = new char[1];
temp[0] = '?';

// create the query string for post/get types
Uri uri = _type == PostType.Post ? new Uri( url ) : new Uri( ( url + "?" + postData ).TrimEnd( temp ) );

// create the request
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( uri );

request.Accept = _accept;
request.ContentType = _contentType;
request.Method = _type == PostType.Post ? "POST" : "GET";
request.CookieContainer = _cookieContainer;
request.Referer = _referer;
request.AllowAutoRedirect = _allowRedirect;
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3";

// set the timeout to a big value like 2 minutes
request.Timeout = 120000;

// set our credentials
request.Credentials = CredentialCache.DefaultCredentials;

// if we have a proxy set its creds as well
if( request.Proxy != null )
{
   request.Proxy.Credentials = CredentialCache.DefaultCredentials;
}


// append post items if we need to
if( !String.IsNullOrEmpty( _body ) )
{
  using( StreamWriter sw = new StreamWriter( request.GetRequestStream(), Encoding.ASCII ) )
  {
     sw.Write( _body );
  }
}

if( _type == PostType.Post &&
     String.IsNullOrEmpty( _body ) )
{
  using( Stream writeStream = request.GetRequestStream() )
  {
      UTF8Encoding encoding = new UTF8Encoding();
      byte[] bytes = encoding.GetBytes( postData );

      writeStream.Write( bytes, 0, bytes.Length );
    }
}

if( _headers.Count > 0 )
{
  request.Headers.Add( _headers );
}//end if

// we want to keep this open for a bit
using( HttpWebResponse response = (HttpWebResponse)request.GetResponse() )
{
    // TODO: do something with the response
}//end using