Working solution at bottom of description!
I am running PHP 5.4, and trying to get the headers of a list of URLs.
For the most part, everything is working fine, but there are three URLs that are causing issues (and likely more, with more extensive testing).
'http://www.alealimay.com'
'http://www.thelovelist.net'
'http://www.bleedingcool.com'
All three sites work fine in a browser, and produce the following header responses:
(From Safari)
Note that all three header responses are Code = 200
But retrieving the headers via PHP, using get_headers
...
stream_context_set_default(array('http' => array('method' => "HEAD")));
$headers = get_headers($url, 1);
stream_context_set_default(array('http' => array('method' => "GET")));
... returns the following:
url ...... "http://www.alealimay.com"
headers
| 0 ............................ "HTTP/1.0 400 Bad Request"
| content-length ............... "378"
| X-Synthetic .................. "true"
| expires ...................... "Thu, 01 Jan 1970 00:00:00 UTC"
| pragma ....................... "no-cache"
| cache-control ................ "no-cache, must-revalidate"
| content-type ................. "text/html; charset=UTF-8"
| connection ................... "close"
| date ......................... "Wed, 24 Aug 2016 01:26:21 UTC"
| X-ContextId .................. "QIFB0I8V/xsTFMREg"
| X-Via ........................ "1.0 echo109"
url ...... "http://www.thelovelist.net"
headers
| 0 ............................ "HTTP/1.0 400 Bad Request"
| content-length ............... "378"
| X-Synthetic .................. "true"
| expires ...................... "Thu, 01 Jan 1970 00:00:00 UTC"
| pragma ....................... "no-cache"
| cache-control ................ "no-cache, must-revalidate"
| content-type ................. "text/html; charset=UTF-8"
| connection ................... "close"
| date ......................... "Wed, 24 Aug 2016 01:26:22 UTC"
| X-ContextId .................. "aNKvf2RB/bIMjWyjW"
| X-Via ........................ "1.0 echo103"
url ...... "http://www.bleedingcool.com"
headers
| 0 ............................ "HTTP/1.1 403 Forbidden"
| Server ....................... "Sucuri/Cloudproxy"
| Date ......................... "Wed, 24 Aug 2016 01:26:22 GMT"
| Content-Type ................. "text/html"
| Content-Length ............... "5311"
| Connection ................... "close"
| Vary ......................... "Accept-Encoding"
| ETag ......................... "\"57b7f28e-14bf\""
| X-XSS-Protection ............. "1; mode=block"
| X-Frame-Options .............. "SAMEORIGIN"
| X-Content-Type-Options ....... "nosniff"
| X-Sucuri-ID .................. "11005"
This is the case regardless of changing the stream_context
//stream_context_set_default(array('http' => array('method' => "HEAD")));
$headers = get_headers($url, 1);
//stream_context_set_default(array('http' => array('method' => "GET")));
Produces the same result.
No warnings or errors are thrown for any of these (normally have the errors suppressed with @get_headers
, but there is no difference either way).
I have checked my php.ini
, and have allow_url_fopen
set to On
.
I am headed towards stream_get_meta_data
, and am not interested in CURL
solutions. stream_get_meta_data
(and its accompanying fopen
) will fail in the same spot as get_headers
, so fixing one will fix both in this case.
Usually, if there are redirects, the output looks like:
url ...... "http://www.startingURL.com/"
headers
| 0 ............................ "HTTP/1.1 301 Moved Permanently"
| 1 ............................ "HTTP/1.1 200 OK"
| Date
| | "Wed, 24 Aug 2016 02:02:29 GMT"
| | "Wed, 24 Aug 2016 02:02:32 GMT"
|
| Server
| | "Apache"
| | "Apache"
|
| Location ..................... "http://finishingURL.com/"
| Connection
| | "close"
| | "close"
|
| Content-Type
| | "text/html; charset=UTF-8"
| | "text/html; charset=UTF-8"
|
| Link ......................... "; rel=\"https://api.w.org/\", ; rel=shortlink"
How come the sites work in browsers, but fail when using get_headers
?
There are various SO posts discussing the same thing, but the solution for all of them doesn't pertain to this case:
POST
requires Content-Length
(I'm sending a HEAD
request, no content is returned)
URL contains UTF-8 data (The only chars in these URLs are all from the Latin alphabet)
Cannot send a URL with spaces in it (These URLs are all space-free, and very ordinary in every way)
Solution!
(Thanks to Max in the answers below for pointing me on the right track.)
The issue is because there is no pre-defined user_agent
, without either setting on in php.ini
, or declaring it in code.
So, I change the user_agent
to mimic a browser, do the deed, and then revert it back to stating value (likely blank).
$OriginalUserAgent = ini_get('user_agent');
ini_set('user_agent', 'Mozilla/5.0');
$headers = @get_headers($url, 1);
ini_set('user_agent', $OriginalUserAgent);
User agent change found here.
It happens because all three these sites are checking UserAgent header of the request and response with an error in that case if it could not be matched.
get_headers
function do not send this header. You may try cURL and this code snippet for getting content of the sites:UPD: It's not necessary to use cURL for setting user agent header. It can be also done with
ini_set('user_agent', 'Mozilla/5.0');
and thenget_headers
function will use configured value.