I need to index a whole lot of webpages, what good webcrawler utilities are there? I'm preferably after something that .NET can talk to, but that's not a showstopper.
What I really need is something that I can give a site url to & it will follow every link and store the content for indexing.
HTTrack -- http://www.httrack.com/ -- is a very good Website copier. Works pretty good. Have been using it for a long time.
Nutch is a web crawler(crawler is the type of program you're looking for) -- http://lucene.apache.org/nutch/ -- which uses a top notch search utility lucene.
Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in 5 minutes.
You can set your own filter to visit pages or not (urls) and define some operation for each crawled page according to your logic.
Some reasons to select crawler4j;
- Multi-Threaded Structure,
- You can Set Depth to be crawled,
- It is Java Based and open source,
- Control for redundant links (urls),
- You can set number of pages to be crawled,
- You can set page size to be crawled,
- Enough documentation
Searcharoo.NET contains a spider that crawls and indexes content, and a search engine to use it. You should be able to find your way around the Searcharoo.Indexer.EXE code to trap the content as it's downloaded, and add your own custom code from there...
It's very basic (all the source code is included, and is explained in six CodeProject articles, the most recent of which is here Searcharoo v6): the spider follows links, imagemaps, images, obeys ROBOTS directives, parses some non-HTML file types. It is intended for single websites (not the entire web).
Nutch/Lucene is almost certainly a more robust/commercial-grade solution - but I have not looked at their code. Not sure what you are wanting to accomplish, but have you also seen Microsoft Search Server Express?
Disclaimer: I am the author of Searcharoo; just offering it here as an option.
Sphider is pretty good. It's PHP, but it might be of some help.
I use Mozenda's Web Scraping software. You could easily have it crawl all of the links and grab all of the information you need and it's a great
software for the money.
I haven't used this yet, but this looks interesting. The author wrote it from scratch and posted how he did. The code for it is available for download as well.