How do web spiders differ from Wget's spider?

2019-01-23 19:19发布

问题:

The next sentence caught my eye in Wget's manual

wget --spider --force-html -i bookmarks.html

This feature needs much more work for Wget to get close to the functionality of real web spiders.

I find the following lines of code relevant for the spider option in wget.

src/ftp.c
780:      /* If we're in spider mode, don't really retrieve anything.  The
784:      if (opt.spider)
889:  if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227:      if (!opt.spider)
1239:      if (!opt.spider)
1268:      else if (!opt.spider)
1827:          if (opt.htmlify && !opt.spider)

src/http.c
64:#include "spider.h"
2405:  /* Skip preliminary HEAD request if we're not in spider mode AND
2407:  if (!opt.spider
2428:      if (opt.spider && !got_head)
2456:      /* Default document type is empty.  However, if spider mode is
2570:           * spider mode.  */
2571:          else if (opt.spider)
2661:              if (opt.spider)

src/res.c
543:  int saved_sp_val = opt.spider;
548:  opt.spider       = false;
551:  opt.spider       = saved_sp_val;  

src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)

src/spider.h
1:/* Declarations for spider.c

src/recur.c
52:#include "spider.h"
279:      if (opt.spider)
366:              || opt.spider /* opt.recursive is implicitely true */
370:             (otherwise unneeded because of --spider or rejected by -R) 
375:                   (opt.spider ? "--spider" : 
378:                     (opt.delete_after || opt.spider
440:      if (opt.spider) 

src/options.h
62:  bool spider;           /* Is Wget in spider mode? */

src/init.c
238:  { "spider",           &opt.spider,            cmd_boolean },

src/main.c
56:#include "spider.h"
238:    { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435:       --spider                  don't download anything.\n"),
1045:  if (opt.recursive && opt.spider)

I would like to see the differences in code, not abstractly. I love code examples.

How do web spiders differ from Wget's spider in code?

回答1:

A real spider is a lot of work

Writing a spider for the whole WWW is quite a task --- you have to take care about many "little details" such as:

  • Each spider computer should receive data from a few thousand servers in parallel in order to make efficient use of the connection bandwidth. (asynchronous socket i/o).
  • You need several computers that spider in parallel in order to cover the vast amount of information on the WWW (clustering; partitioning the work)
  • You need to be polite to the spidered web sites:
    • Respect the robots.txt files.
    • Don't fetch a lot of information too quickly: this overloads the servers.
    • Don't fetch files that you really don't need (e.g. iso disk images; tgz packages for software download...).
  • You have to deal with cookies/session ids: Many sites attach unique session ids to URLs to identify client sessions. Each time you arrive at the site, you get a new session id and a new virtual world of pages (with the same content). Because of such problems, early search engines ignored dynamic content. Modern search engines have learned what the problems are and how to deal with them.
  • You have to detect and ignore troublesome data: connections that provide a seemingly infinite amount of data or connections that are too slow to finish.
  • Besides following links, you may want to parse sitemaps to get URLs of pages.
  • You may want to evaluate which information is important for you and changes frequently to be refreshed more frequently than other pages. Note: A spider for the whole WWW receives a lot of data --- you pay for that bandwidth. You may want to use HTTP HEAD requests to guess whether a page has changed or not.
  • Besides receiving, you want to process the information and store it. Google builds indices that list for each word the pages that contain it. You may need separate storage computers and an infrastructure to connect them. Traditional relational data bases don't keep up with the data volume and performance requirements of storing/indexing the whole WWW.

This is a lot of work. But if your target is more modest than reading the whole WWW, you may skip some of the parts. If you just want to download a copy of a wiki etc. you get down to the specs of wget.

Note: If you don't believe that it's so much work, you may want to read up on how Google re-invented most of the computing wheels (on top of the basic Linux kernel) to build good spiders. Even if you cut a lot of corners, it's a lot of work.

Let me add a few more technical remarks on three points

Parallel connections / asynchronous socket communication

You could run several spider programs in parallel processes or threads. But you need about 5000-10000 parallel connections in order to make good use of your network connection. And this amount of parallel processes/threads produces too much overhead.

A better solution is asynchronous input/output: process about 1000 parallel connections in one single thread by opening the sockets in non-blocking mode and use epoll or select to process just those connections that have received data. Since Linux kernel 2.4, Linux has excellent support for scalability (I also recommend that you study memory-mapped files) continuously improved in later versions.

Note: Using asynchronous i/o helps much more than using a "fast language": It's better to write an epoll-driven process for 1000 connections written in Perl than to run 1000 processes written in C. If you do it right, you can saturate a 100Mb connection with processes written in perl.

From the original answer: The down side of this approach is that you will have to implement the HTTP specification yourself in an asynchronous form (I am not aware of a re-usable library that does this). It's much easier to do this with the simpler HTTP/1.0 protocol than the modern HTTP/1.1 protocol. You probably would not benefit from the advantages of HTTP/1.1 for normal browsers anyhow, so this may be a good place to save some development costs.

Edit five years later: Today, there is a lot of free/open source technology available to help you with this work. I personally like the asynchronous http implementation of node.js --- it saves you all the work mentioned in the above original paragraph. Of course, today there are also a lot of modules readily available for the other components that you need in your spider. Note, however, that the quality of third-party modules may vary considerably. You have to check out whatever you use. [Aging info:] Recently, I wrote a spider using node.js and I found the reliability of npm modules for HTML processing for link and data extraction insufficient. For this job, I "outsourced" this processing to a process written in another programming language. But things are changing quickly and by the time you read this comment, this problem may already a thing of the past...

Partitioning the work over several servers

One computer can't keep up with spidering the whole WWW. You need to distribute your work over several servers and exchange information between them. I suggest to assign certain "ranges of domain names" to each server: keep a central data base of domain names with a reference to a spider computer.

Extract URLs from received web pages in batches: sort them according to their domain names; remove duplicates and send them to the responsible spider computer. On that computer, keep an index of URLs that already are fetched and fetch the remaining URLs.

If you keep a queue of URLs waiting to be fetched on each spider computer, you will have no performance bottlenecks. But it's quite a lot of programming to implement this.

Read the standards

I mentioned several standards (HTTP/1.x, Robots.txt, Cookies). Take your time to read them and implement them. If you just follow examples of sites that you know, you will make mistakes (forget parts of the standard that are not relevant to your samples) and cause trouble for those sites that use these additional features.

It's a pain to read the HTTP/1.1 standard document. But all the little details got added to it because somebody really needs that little detail and now uses it.



回答2:

I am not sure exactly what the original author of the comment was referring to, but I can guess that wget is slow as a spider, since it appears to only use a single thread of execution (at least by what you have shown).

"Real" spiders such as heritrix use a lot of parallelism and tricks to optimize their crawling speed, while simultaneously being nice to the website they are crawling. This typically means limiting hits to one site at a rate of 1 per second (or so), and crawling multiple websites at the same time.

Again this is all just a guess based on what I know of spiders in general, and what you posted here.



回答3:

Unfortunately, many of the more well-known 'real' web spiders are closed-source, and indeed closed-binary. However there are a number of basic techniques wget is missing:

  • Parallelism; you're never going to be able to keep up with the entire web without retrieving multiple pages at a time
  • Prioritization; some pages are more important to spider than others
  • Rate limiting; you'll be banned quickly if you keep pulling down pages as quickly as you can
  • Saving to something other than a local filesystem; the Web is big enough that it's not going to fit in a single directory tree
  • Rechecking pages periodically without restarting the entire process; in practice, with a real spider you'll want to recheck 'important' pages for updates frequently, while less interesting pages can go for months.

There are also various other inputs that can be used such as sitemaps and the like. Point is, wget isn't designed to spider the entire web, and it's not really a thing that can be captured in a small code sample, as it's a problem of the whole overall technique being used, rather than any single small subroutine being wrong for the task.



回答4:

I'm not going to go into details of how to spider the internet, I think that wget comment is regarding to spidering one website which is still a serious challenge.

  • As a spider you need to figure out when to stop, not go into recursive crawls just because the URL changed like date=1/1/1900 to 1/2/1900 and so
  • Even bigger challenge to sort out URL Rewrite (I have no clue what so ever how google or any other handles this). It's pretty big challenge to crawl enough but not too much. And how one can automatically recognise URL Rewrite with some random parameters and random changes in the content?
  • You need to parse Flash / Javascript at least up to some level
  • You need to consider some crazy HTTP issues like base tag. Even parsing the HTML is not easy, considering most of the websites are not XHTML and browsers are so flexible in the syntax.

I don't know how much of these implemented or considered in wget but you might want to take a look at httrack to understand the challenges of this task.

I'd love to give you some code examples but this is big tasks and a decent spider will be about 5000 loc without 3rd party libraries.

+ Some of them already explained by @yaakov-belch so I'm not going to type them again