Can't seem to get a web page's contents vi

2019-02-25 08:43发布

问题:

For some reason I can't seem to get this particular web page's contents via cURL. I've managed to use cURL to get to the "top level page" contents fine, but the same self-built quick cURL function doesn't seem to work for one of the linked off sub web pages.

Top level page: http://www.deindeal.ch/

A sub page: http://www.deindeal.ch/deals/hotel-cristal-in-nuernberg-30/

My cURL function (in functions.php)

function curl_get($url) {
    $ch = curl_init();
    $header = array(
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
    'Accept-Language: en-us;q=0.8,en;q=0.6'
    );
    $options = array(
        CURLOPT_URL => $url, 
        CURLOPT_HEADER => 0, 
        CURLOPT_RETURNTRANSFER => 1, 
        CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
        CURLOPT_HTTPHEADER => $header
        );
    curl_setopt_array($ch, $options);
    $return = curl_exec($ch);
    curl_close($ch);

    return $return;
}

PHP file to get the contents (using echo for testing)

require "functions.php";
require "phpQuery.php";

echo curl_get('http://www.deindeal.ch/deals/hotel-walliserhof-zermatt-2-naechte-30/');

So far I've attempted the following to get this to work

  • Ran the file both locally (XAMPP) and remotely (LAMP).
  • Added in the user-agent and HTTP headers as recommended here file_get_contents and CURL can't open a specific website - before the function curl_get() contained all the options as current, except for CURLOPT_USERAGENTandCURLOPT_HTTPHEADERS`.

Is it possible for a website to completely block requests via cURL or other remote file opening mechanisms, regardless of how much data is supplied to attempt to make a real browser request?

Also, is it possible to diagnose why my requests are turning up with nothing?

Any help answering the above two questions, or editing/making suggestions to get the file's contents, even if through a method different than cURL would be greatly appreciated ;).

回答1:

Try adding:

CURLOPT_FOLLOWLOCATION => TRUE

to your options.

If you run a simple curl request from the command line (including a -i to see the response headers) then it is pretty easy to see:

$ curl -i 'http://www.deindeal.ch/deals/hotel-cristal-in-nuernberg-30/'
HTTP/1.1 302 FOUND
Date: Fri, 30 Dec 2011 02:42:54 GMT
Server: Apache/2.2.16 (Debian)
Vary: Accept-Language,Cookie,Accept-Encoding
Content-Language: de
Set-Cookie: csrftoken=d127d2de73fb3bd72e8986daeca86711; Domain=www.deindeal.ch; Max-Age=31449600; Path=/
Set-Cookie: generic_cookie=1; Path=/
Set-Cookie: sessionid=987b1a11224ecd0e009175470cf7317b; expires=Fri, 27-Jan-2012 02:42:54 GMT; Max-Age=2419200; Path=/
Location: http://www.deindeal.ch/welcome/?deal_slug=hotel-cristal-in-nuernberg-30
Content-Length: 0
Connection: close
Content-Type: text/html; charset=utf-8

As you can see, it returns a 302 with a Location header. If you hit that location directly, you will get the content you are looking for.

And to answer your two questions:

  1. No, it is not possile to block requests from something like curl. If the consumer can talk HTTP then it can get to anything the browser can get to.
  2. Diagnosing with an HTTP proxy could have been helpful for you. Wireshark, fiddler, charles, et al. should help you out in the future. Or, do like I did and make a request from the command line.

EDIT
Ah, I see what you are talking about now. So, when you go to that link for the first time you are redirected and a cookie (or cookies) is set. Once you have those cookie, your request goes through as intended.

So, you need to use a cookiejar, like in this example: http://icfun.blogspot.com/2009/04/php-how-to-use-cookie-jar-with-curl.html

So, you will need to make an initial request, save the cookies, and make your subsequent requests including the cookies after that.



标签: php http curl