PHP - `get_headers` returns “400 Bad Request” and

2019-05-01 17:10发布

问题:

Working solution at bottom of description!

I am running PHP 5.4, and trying to get the headers of a list of URLs.

For the most part, everything is working fine, but there are three URLs that are causing issues (and likely more, with more extensive testing).

'http://www.alealimay.com'
'http://www.thelovelist.net'
'http://www.bleedingcool.com'

All three sites work fine in a browser, and produce the following header responses:

(From Safari)

Note that all three header responses are Code = 200

But retrieving the headers via PHP, using get_headers...

stream_context_set_default(array('http' => array('method' => "HEAD")));
$headers = get_headers($url, 1);
stream_context_set_default(array('http' => array('method' => "GET")));

... returns the following:

url  ......  "http://www.alealimay.com"

headers
|    0  ............................  "HTTP/1.0 400 Bad Request"
|    content-length  ...............  "378"
|    X-Synthetic  ..................  "true"
|    expires  ......................  "Thu, 01 Jan 1970 00:00:00 UTC"
|    pragma  .......................  "no-cache"
|    cache-control  ................  "no-cache, must-revalidate"
|    content-type  .................  "text/html; charset=UTF-8"
|    connection  ...................  "close"
|    date  .........................  "Wed, 24 Aug 2016 01:26:21 UTC"
|    X-ContextId  ..................  "QIFB0I8V/xsTFMREg"
|    X-Via  ........................  "1.0 echo109"



url  ......  "http://www.thelovelist.net"

headers
|    0  ............................  "HTTP/1.0 400 Bad Request"
|    content-length  ...............  "378"
|    X-Synthetic  ..................  "true"
|    expires  ......................  "Thu, 01 Jan 1970 00:00:00 UTC"
|    pragma  .......................  "no-cache"
|    cache-control  ................  "no-cache, must-revalidate"
|    content-type  .................  "text/html; charset=UTF-8"
|    connection  ...................  "close"
|    date  .........................  "Wed, 24 Aug 2016 01:26:22 UTC"
|    X-ContextId  ..................  "aNKvf2RB/bIMjWyjW"
|    X-Via  ........................  "1.0 echo103"



url  ......  "http://www.bleedingcool.com"

headers
|    0  ............................  "HTTP/1.1 403 Forbidden"
|    Server  .......................  "Sucuri/Cloudproxy"
|    Date  .........................  "Wed, 24 Aug 2016 01:26:22 GMT"
|    Content-Type  .................  "text/html"
|    Content-Length  ...............  "5311"
|    Connection  ...................  "close"
|    Vary  .........................  "Accept-Encoding"
|    ETag  .........................  "\"57b7f28e-14bf\""
|    X-XSS-Protection  .............  "1; mode=block"
|    X-Frame-Options  ..............  "SAMEORIGIN"
|    X-Content-Type-Options  .......  "nosniff"
|    X-Sucuri-ID  ..................  "11005"

This is the case regardless of changing the stream_context

//stream_context_set_default(array('http' => array('method' => "HEAD")));
$headers = get_headers($url, 1);
//stream_context_set_default(array('http' => array('method' => "GET")));

Produces the same result.

No warnings or errors are thrown for any of these (normally have the errors suppressed with @get_headers, but there is no difference either way).

I have checked my php.ini, and have allow_url_fopen set to On.

I am headed towards stream_get_meta_data, and am not interested in CURL solutions. stream_get_meta_data (and its accompanying fopen) will fail in the same spot as get_headers, so fixing one will fix both in this case.

Usually, if there are redirects, the output looks like:

url  ......  "http://www.startingURL.com/"

headers
|    0  ............................  "HTTP/1.1 301 Moved Permanently"
|    1  ............................  "HTTP/1.1 200 OK"
|    Date
|    |    "Wed, 24 Aug 2016 02:02:29 GMT"
|    |    "Wed, 24 Aug 2016 02:02:32 GMT"
|    
|    Server
|    |    "Apache"
|    |    "Apache"
|    
|    Location  .....................  "http://finishingURL.com/"
|    Connection
|    |    "close"
|    |    "close"
|    
|    Content-Type
|    |    "text/html; charset=UTF-8"
|    |    "text/html; charset=UTF-8"
|    
|    Link  .........................  "; rel=\"https://api.w.org/\", ; rel=shortlink"

How come the sites work in browsers, but fail when using get_headers?

There are various SO posts discussing the same thing, but the solution for all of them doesn't pertain to this case:

POST requires Content-Length (I'm sending a HEAD request, no content is returned)

URL contains UTF-8 data (The only chars in these URLs are all from the Latin alphabet)

Cannot send a URL with spaces in it (These URLs are all space-free, and very ordinary in every way)

Solution!

(Thanks to Max in the answers below for pointing me on the right track.)

The issue is because there is no pre-defined user_agent, without either setting on in php.ini, or declaring it in code.

So, I change the user_agent to mimic a browser, do the deed, and then revert it back to stating value (likely blank).

$OriginalUserAgent = ini_get('user_agent');
ini_set('user_agent', 'Mozilla/5.0');

$headers = @get_headers($url, 1);

ini_set('user_agent', $OriginalUserAgent);

User agent change found here.

回答1:

It happens because all three these sites are checking UserAgent header of the request and response with an error in that case if it could not be matched. get_headers function do not send this header. You may try cURL and this code snippet for getting content of the sites:

$url = 'http://www.alealimay.com';
$c = curl_init($url);
curl_setopt($c, CURLOPT_USERAGENT, 'curl/7.48.0');
curl_exec($c);
var_dump(curl_getinfo($c));

UPD: It's not necessary to use cURL for setting user agent header. It can be also done with ini_set('user_agent', 'Mozilla/5.0'); and then get_headers function will use configured value.