I am trying out Goutte, the PHP web crawler based on Symfony2 components. I've successfully retrieved Google in both plaintext and SSL forms. However, I've come across an ASP/SSL page that won't load.
Here's my code:
// Load a crawler/browser system
require_once 'vendor/goutte/goutte.phar';
// Here's a demo of a page we want to parse
$uri = '(removed)';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('GET', $uri);
echo $crawler->text() . "\n";
Instead, the echo at the end of the above code, for this one site, gives me this:
Bad Request (Invalid Header Name)
I can see the site fine in Firefox, and the HTML for it can be retrieved fine using wget --no-check-certificate
with no other options (setting the header or user agent, for example).
I suspect I need to set some HTTP headers in Goutte. Has anyone any ideas which ones I should try?
I had this problems too.
Adding
User-Agent
header was not enough. I addedHTTP_USER_AGENT
usingsetServerParameter
function and it worked like a charm.Here's the complete code:
I discovered that my browser and
wget
both add a non-empty user agent field in the header, so I am assuming Goutte sets nothing here. Adding this header to the browser object prior to the fetch fixes the problem:Here I've copied in my browser agent string, but in this case I think anything would work - as long as it is set.
Incidentally, I used a browser UA here as I was trying to accurately replicate the browser environment for debugging this particular problem. Once it worked I switched to a custom UA, so target sites can detect it as a bot if they wish to (for this project I don't think anyone has).