I am writing an application to measure how fast I can download web pages using C#. I supply a list of unique domain names, then I spawn X number of threads and perform HTTPWebRequests until the list of domains has been consumed. The problem is that no matter how many threads I use, I only get about 3 pages per second.
I discovered that the System.Net.ServicePointManager.DefaultConnectionLimit is 2, but I was under the impression that this is related to the number of connections per domain. Since each domain in the list is unique, this should not be an issue.
Then I found that the GetResponse() method blocks access from all other processes until the WebResponse is closed: http://www.codeproject.com/KB/IP/Crawler.aspx#WebRequest, I have not found any other information on the web to back this claim up, however I implemented a HTTP request using sockets, and I noticed a significant speed up (4x to 6x).
So my questions: does anyone know exactly how the HttpWebRequest objects work?, is there a workaround besides what was mentioned above?, or are there any examples of high speed web crawlers written in C# anywhere?
Have you tried using the async methods such as BeginGetResponse() ?
If you're using .net 4.0 you may want to try this code. Essentially I use Tasks to make 1000 requests on a specific site (I use this to do load testing of app on my dev machine and I see no limits as such since my app is seeing these requests in rapid succession)
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
for (int i = 0; i < 1000; i++)
{
var webRequest = WebRequest.Create(textBox1.Text);
webRequest.GetReponseAsync().ContinueWith(t =>
{
if (t.Exception == null)
{
using (var sr = new StreamReader(t.Result.GetResponseStream()))
{
string str = sr.ReadToEnd();
}
}
else
System.Diagnostics.Debug.WriteLine(t.Exception.InnerException.Message);
});
}
}
}
public static class WebRequestExtensions
{
public static Task<WebResponse> GetReponseAsync(this WebRequest request)
{
return Task.Factory.FromAsync<WebResponse>(request.BeginGetResponse, request.EndGetResponse, null);
}
}
Since the workload here is I/O bound, spawning threads to get the job done is not required and in fact could hurt performance. Using the Async methods on the WebClient class use I/O completion ports and so will be much more performant and less resource hungry.
You should be using the BeginGetResponse method which doesn't block and is asynchronous.
When dealing with I/O bound asynchrony, just because you spawn a thread to do the I/O work, that thread will still be blocked waiting for the hardware (in this case the network card) to respond. If you use the built in BeginGetResponse, then that thread will just queue it up on the network card, and will then be available to do more work. When the hardware is done, it'll notify you, at which point your callback will be called.
I would like to note that BeginGetResponse method isn't completely asynchronous: (from MSDN)
The BeginGetResponse method requires some synchronous setup tasks to complete (DNS resolution, proxy detection, and TCP socket connection, for example) before this method becomes asynchronous. As a result, this method should never be called on a user interface (UI) thread because it might take some time, typically several seconds.