Using some basic website scraping, I am trying to prepare a database for price comparison which will ease users' search experiences. Now, I have several questions:
Should I use file_get_contents()
or curl
to get the contents of the required web page?
$link = "http://xyz.com";
$res55 = curl_init($link);
curl_setopt ($res55, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($res55, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($res55);
Further, every time I crawl a web page, I fetch a lot of links to visit next. This may take a long time (days if you crawl big websites like Ebay). In that case, my PHP code will time-out. What should be the automated way to do this? Is there a way to prevent PHP from timing out by making changes on the server, or is there another solution?
I recommend curl for reading website contents.
To avoid the PHP script timing out, you can use
set_time_limit
. The advantage of this is that you can set the timeout for every server connection to terminate the script, since calling the method resets the previous timeout countdown. No time limit will be applied if 0 is passed as the parameter.Alternatively, you can set timeout in the php configuration property max_execution_time, but note that this will apply to all php scripts rather than just the crawler.
http://php.net/manual/en/function.set-time-limit.php
Are you doing this in the code that's driving your web page? That is, when someone makes a request, are you crawling right then and there to build the response? If so, then yes there is definitely a better way.
If you have a list of the sites you need to crawl, you can set up a scheduled job (using cron for example) to run a command-line application (not a web page) to crawl the sites. At that point you should parse out the data you're looking for and store it in a database. Your site would then just need to point to that database.
This is an improvement for two reasons:
Performance: In a request/response system like a web site, you want to minimize I/O bottlenecks. The response should take as little time as possible. So you want to avoid in-line work wherever possible. By offloading this process to something outside the context of the website and using a local database, you turn a series of external service calls (slow) to a single local database call (much faster).
Code Design: Separation of concerns. This setup modularizes your code a little bit more. You have one module which is in charge of fetching the data and another which is in charge of displaying the data. Neither of them should ever need to know or care about how the other accomplishes its tasks. So if you ever need to replace one (such as finding a better scraping method) you won't also need to change the other.
I'd opt for cURL since you get much more flexibility and you can enable compression and http keep-alive with cURL.
But why re-invent the wheel? Check out PHPCrawl. It uses sockets (
fsockopen
) to download URLs but supports multiple crawlers at once (on Linux) and has a lot of options for crawling that probably meet all of your needs. They take care of timeouts for you as well and have good examples available for basic crawlers.You could reinvent the wheel here, but why not look at a framework like PHPCrawl or Sphider? (although the latter may not be exactly what you're looking for)
Per the documentation,
file_get_contents
works best for reading files on the server, so I strongly suggest you usecurl
instead. As for fixing any timeout issues,set_time_limit
is the option you want.set_time_limit(0)
should prevent your script from timing out.You'll want to set the timeout in Apache as well, however. Edit your
httpd.conf
and change the line that readsTimeout
toTimeout 0
for an infinite timeout.curl is the good options. file_get_contents is for reading files on your server
You can set the timeout in curl to 0 in order to have unlimited timeout. You have to set the timeout on Apache too