I am doing a webcrawler and using threads to download pages.
The first limiting factor to the performance of my program is the bandwidth, I can never download more pages that it can get.
The second thing is what I interested. I am using threads to download many pages at same time, but as I create more threads, more sharing of processor occurs. Is there some metric/way/class of tests to determine what is the ideal number of threads or if after certain number, the performance doesn't change or decrease?
I say use something like Akka manage the threads for u. Use Jersey http client lib with non blocking IO which works with callback if i remember correctly. It's possibly the ideal setting for that type of tasks.
we've developped a multithreaded parrallel web crawler. Benchmarking troughput is the best way to get ideas on how the beast will handle his job. For a dedicated java server, one thread per core is a base to start, then the I/O comes into play and change.
Performances do decrease after certain number of threads. But it depends on the site you crawl too, on the OS you use, etc. Try to find a site with a merely constant response time to do your first benchmarks (like Google, but take differents services)
With slow websites, higher number of threads tends to compensate i/o blocking
Have a look at my answer in this thread
How to find out the optimal amount of threads?
Your example will likely be CPU bound, so you need a way to work out the contention to be able to work out the right number of threads on your box to use and be able to keep them all busy. Profiling will help there but remember it'll depend on the number of cores (as well as the network latency already mentioned etc) so use the runtime to get the number of cores when wiring up your thread pool size.
No quick answer I'm afraid, there will be an element of test, measure, adjust, repeat I'm afraid!
The ideal number of thread should be close to the number of cores (virtual cores) your hardware provides. This is to avoid thread context switching and thread scheduling. If you're doing heavy IO operations with many blocking reads (your thread blocks on a socket read) I suggest you redesign your code to use non-blocking IO APIs. Typically this will involve one "selector" thread that will monitor the activity of thousands of sockets and a small number of worker threads that will do the processing. If you code is in Java, the APIs are NIO. The only blocking call will be when you call
selector.select()
and it will only block if there is nothing to be processed on any of the thousands of sockets. Event-driven frameworks such as netty.io use this model and have proven to be very scalable and to best use the hardware resources of the system.