How to crawl billions of pages? [closed]

2019-03-07 23:24发布

问题:

Is it possible to crawl billions of pages on a single server?

回答1:

Not if you want the data to be up to date.

Even a small player in the search game would number the pages crawled in the multiple billions.

" In 2006, Google has indexed over 25 billion web pages,[32] 400 million queries per day,[32] 1.3 billion images, and over one billion Usenet messages. " - Wikipedia

And remember the quote is mentioning numbers from 2006. This is ancient history. State of the art is well more than that.

Freshness of content:

  1. New content is constantly added at a very large rate (reality)
  2. Existing pages often change - you'll need to recrawl for two reasons a) determine if it is dead, b) determine if the content has changed.

Politeness of crawler:

  1. You can't overwhelm any one given sites. If you are hitting any major site repeatedly from the same IP you're likley to trigger either a CAPTCHA prompt or they'll block your IP address. Sites will do this based on bandwidth, frequency of requests, # of "bad" page requests, and all sorts of other things.
  2. There is a robots.txt protocol that sites expose to crawlers, obey it.
  3. There is a sitemap standard that sites expose to crawlers, use it to help you explore - you can also (if you choose) weight the relative importance of pages on the site and the use the time to live in your cache if indicated in the site map.

Reduce the work you need to do:

  1. Often sites expose themselves through multiple names - you'll want to detect pages that are identical - this can happen on the same url or on seperate urls. Consider a hash on page contents (minus headers with dates/times that constantly change). Keep track of these page equivalencies and skip them next time or determine if there is a well known mapping between the given sites so that you don't have to crawl them.
  2. SPAM. Tons of people out there making tons of pages that are just pass throughs to google but they "seed" themselves all over the web to get themselves crawled.

So - you're always in a cycle of crawl. Always. You'll almost certainly be on several (many many many) machines. to ensure you can comply with politeness but still rock out on the freshness of data.

If you want to press the fast forward button and just get to processing pages with your own unique algorithm.... you could likely tap into a pre-built crawler if you need it quickly - think "80 legs" as highlighted in Programmable Web. They do it using client side computing power.

80 legs is using machine cycles from kids playing games on web sites. Think of a background process on a web page that does calls out and does work while you’re using that page/site without you knowing it because they are using the Plura technology stack.

“Plura Processing has developed a new and innovative technology for distributed computing. Our patent-pending technology can be embedded in any webpage. Visitors to these webpages become nodes and perform very small computations for the application running on our distributed computing network.” - Plura Demo Page

So they are issuing the "crawl" through thousands of nodes at thousands of IPs and being polite to sites and crawling fast as a result. Now I personally don't know that I care for that style of using the end user's browser unless it were called out on all of the sites using their technology VERY clearly - but it's an out of the box approach if nothing else.

There are other crawlers that have been written that are in community driven projects that you could likely use as well.

As pointed out by other respondents - do the math. You'll need ~2300 pages crawled per second to keep up with crawling 1B pages every 5 days. If you're willing to wait longer the number goes down. If you're thinking you're going to crawl more than 1B the number goes up. Simple math.

Good luck!



回答2:

Large scale spidering (a billion pages) is a difficult problem. Here are some of the issues:

  • Network bandwidth. Assuming that each page is 10Kb, then you are talking about a total of 10 Terabytes to be fetched.

  • Network latency / slow servers / congestion mean that you are not going to achieve anything like the theoretical bandwidth of your network connection. Multi-threading your crawler only helps so much.

  • I assume that you need to store the information you have extracted from the billions of pages.

  • Your HTML parser needs to deal with web pages that are broken in all sorts of strange ways.

  • To avoid getting stuck in loops, you need to detect that you've "done this page already".

  • Pages change so you need to revisit them.

  • You need to deal with 'robots.txt' and other conventions that govern the behavior of (well-behaved) crawlers.



回答3:

The original paper by Page and Brin (Google) 1998 described crawling 25 million pages on 4 machines in 10 days. They had open 300 connections at a time per machine. I think this is still pretty good. In my own own experiments with off the shelf machines running Linux, I could reliably open 100-200 simultaneous connections.

There are three main things you need to do while crawling: (1) choose what to crawl next, (2) get those pages, (3) store those pages. For (1) you need to implement some kind of priority queue (i.e., to do breadth first search or OPIC), you also need to keep track of where you have been. This can be done using a Bloom filter. Bloom filters (look it up on Wikipedia) can also be used to store if a page had a robot.txt file and if a prefix of a given url is excluded.

(2) getting the pages is a fixed cost and you can't do much about it; however, as on one machine you are limited by the number of open connections, if you have cable you probably won't come close to eating all the available band-width. You might have to worry about bandwidth caps though.

(3) storing the pages is typically done in a web archive file like what the Internet Archive does. With compression, you can probably store a billion pages in 7 terabytes, so storage-wise it would be affordable to have a billion pages. As an estimate of what one machine can do, suppose you get a cheapo $200 machine with 1Gb or ram and 160Gb harddrive. At 20KB a page (use Range requests to avoid swallowing big pages whole), 10 million pages would take 200 GB, but compressed is about 70 GB.

If you keep an archive that your search engine runs off of (on which you have already calculated say page rank and bm25), and an active crawl archive, then you've consumed 140 GB. This leaves you about 20 GB for other random stuff you need to handle. If you work out the memory you need to try to keep as much of your priority queue and the bloom filters in RAM as possible you are also right at the edge of what possible. If you crawl 300,000 pages/day, it'll take you slightly over a month/10million page crawl



回答4:

5 years after the question is asked i can answer yes.

And our crawling machine is not even very expensive anymore, it can be bought by EBay for about 3000 Euro and contains 24x1TB 2,5" Disks (running as Single Disks) with two 6 Core Intel Xeons (making it 12cores/24 threads) and 96GB RAM using a 10GBit Line (with just 33% percentile) in a Luxembourg Datacenter.

It's using 100,000 concurrent HTTP connections which results in about 30,000 pages per second crawled.

Yes the computers are pretty fast today. And by the way, the main problem is the URL handling and the detection of page duplicates (same page reachable on various ways) but not the network connection.



回答5:

Researchers at Texas A&M have created the IRLbot which is highly scalable and capable of crawling billions of web pages in a "short" amount of time (~7 days for 1 billion?) with little resources (i.e. number of PCs). The Texas A&M researchers have provided the following statistics for their crawler:

We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 Mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.

You can read about the design and architecture of their crawler in their published paper, IRLbot: Scaling to 6 Billion Pages and Beyond, or in their full paper (very detailed).

However, the crawl rate is highly dependent on your bandwidth and the amount of data you're processing. So with the above statistics, we can see that Texas A&M has about 319 Mbps connection (about 100 times faster than your average US home connection), it's processing about 22 kB of data per page and is downloading 1,789 pages/second. If you were to run their crawler on your home connection, you can expect the following performance:

  • @3.9 Mbps (avg speed for US residents) / 22 kB per page = ~22 pages per second: it would take about 526 days (~1.5 years) to download 1 billion pages.
  • @20 Mbps (upper end of home bandwidth) / 22 kB/page = ~116 pps: it would take about 100 days (~3 months) to download 1 billion pages.


回答6:

hmm.. if you can "crawl" 1 page per second , you can total 86400 pages per day(11574.074 days needed to finish your 1st billion , use this to calculate needed time according to your page per sec speed) .. Patience is required.. and of course the storage space..



标签: web-crawler