How does a web crawler work?

2019-09-09 11:45发布

Using some basic website scraping, I am trying to prepare a database for price comparison which will ease users' search experiences. Now, I have several questions:

Should I use file_get_contents() or curl to get the contents of the required web page?

$link = "http://xyz.com";
$res55 = curl_init($link);
curl_setopt ($res55, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($res55, CURLOPT_FOLLOWLOCATION, true); 
$result = curl_exec($res55);

Further, every time I crawl a web page, I fetch a lot of links to visit next. This may take a long time (days if you crawl big websites like Ebay). In that case, my PHP code will time-out. What should be the automated way to do this? Is there a way to prevent PHP from timing out by making changes on the server, or is there another solution?

5条回答
Bombasti
2楼-- · 2019-09-09 12:17

I recommend curl for reading website contents.

To avoid the PHP script timing out, you can use set_time_limit. The advantage of this is that you can set the timeout for every server connection to terminate the script, since calling the method resets the previous timeout countdown. No time limit will be applied if 0 is passed as the parameter.

Alternatively, you can set timeout in the php configuration property max_execution_time, but note that this will apply to all php scripts rather than just the crawler.

http://php.net/manual/en/function.set-time-limit.php

查看更多
Rolldiameter
3楼-- · 2019-09-09 12:21

So, in that case my PHP code will time-out and it won't continue that long.

Are you doing this in the code that's driving your web page? That is, when someone makes a request, are you crawling right then and there to build the response? If so, then yes there is definitely a better way.

If you have a list of the sites you need to crawl, you can set up a scheduled job (using cron for example) to run a command-line application (not a web page) to crawl the sites. At that point you should parse out the data you're looking for and store it in a database. Your site would then just need to point to that database.

This is an improvement for two reasons:

  1. Performance
  2. Code Design

Performance: In a request/response system like a web site, you want to minimize I/O bottlenecks. The response should take as little time as possible. So you want to avoid in-line work wherever possible. By offloading this process to something outside the context of the website and using a local database, you turn a series of external service calls (slow) to a single local database call (much faster).

Code Design: Separation of concerns. This setup modularizes your code a little bit more. You have one module which is in charge of fetching the data and another which is in charge of displaying the data. Neither of them should ever need to know or care about how the other accomplishes its tasks. So if you ever need to replace one (such as finding a better scraping method) you won't also need to change the other.

查看更多
闹够了就滚
4楼-- · 2019-09-09 12:27

I'd opt for cURL since you get much more flexibility and you can enable compression and http keep-alive with cURL.

But why re-invent the wheel? Check out PHPCrawl. It uses sockets (fsockopen) to download URLs but supports multiple crawlers at once (on Linux) and has a lot of options for crawling that probably meet all of your needs. They take care of timeouts for you as well and have good examples available for basic crawlers.

查看更多
成全新的幸福
5楼-- · 2019-09-09 12:32

You could reinvent the wheel here, but why not look at a framework like PHPCrawl or Sphider? (although the latter may not be exactly what you're looking for)

Per the documentation, file_get_contents works best for reading files on the server, so I strongly suggest you use curl instead. As for fixing any timeout issues, set_time_limit is the option you want. set_time_limit(0) should prevent your script from timing out.

You'll want to set the timeout in Apache as well, however. Edit your httpd.conf and change the line that reads Timeout to Timeout 0 for an infinite timeout.

查看更多
三岁会撩人
6楼-- · 2019-09-09 12:34

curl is the good options. file_get_contents is for reading files on your server

You can set the timeout in curl to 0 in order to have unlimited timeout. You have to set the timeout on Apache too

查看更多
登录 后发表回答