Does anyone know how tell the 'facebookexternalhit' bot to spread its traffic?
Our website gets hammered every 45 - 60 minutes with spikes of approx. 400 requests per second, from 20 to 30 different IP addresses from the facebook netblocks. Between the spikes the traffic does not disappear, but the load is acceptable. Offcourse we do not want to block the bot, but these spikes are risky. We'd prefer to see the bot spread it's load equally over time. And see it behave like Googlebot & friends.
I've seen related bug reports ( First Bug, Second Bug and Third Bug (#385275384858817)), but could not find any suggestions how to manage the load.
Per other answers, the semi-official word from Facebook is "suck it". It boggles me they cannot follow Crawl-delay (yes, I know it's not a "crawler", however GET'ing 100 pages in a few seconds is a crawl, whatever you want to call it).
Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.
In PHP, execute the following code as quickly as possible for every request.
define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit
if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && preg_match( '/^facebookexternalhit/', $_SERVER['HTTP_USER_AGENT'] ) ) {
$fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
$lastTime = fread( $fh, 100 );
$microTime = microtime( TRUE );
// check current microtime with microtime of last access
if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
// bail if requests are coming too quickly with http 503 Service Unavailable
header( $_SERVER["SERVER_PROTOCOL"].' 503' );
die;
} else {
// write out the microsecond time of last access
rewind( $fh );
fwrite( $fh, $microTime );
}
fclose( $fh );
} else {
header( $_SERVER["SERVER_PROTOCOL"].' 503' );
die;
}
}
You can test this from a command line with something like:
$ rm index.html*; wget -U "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)" http://www.foobar.com/; less index.html
Improvement suggestions are welcome... I would guess their might be some concurrency issues with a huge blast.
I know it's an old, but unanswered, question. I hope this answer helps someone.
There's an Open Graph tag named og:ttl
that allows you to slow down the requests made by the Facebook crawler: (reference)
Crawler rate limiting
You can label pages and objects to change how long Facebook's crawler will wait to check them for new content. Use the og:ttl
object property to limit crawler access if our crawler is being too aggressive.
Checking object properties for og:ttl
states that the default ttl is 30 days for each canonical URL shared. So setting this ttl meta tag will only slow requests down if you have a very large amount of shared objects over time.
But, if you're being reached by Facebook's crawler because of actual live traffic (users sharing a lot of your stories at the same time), this will of course not work.
Another possibility for you to have too many crawler requests, is that your stories are not being shared using a correct canonical url (og:url
) tag.
Let's say, your users can reach certain article on your site from several different sources (actually being able to see and share the same article, but the URL they see is different), if you don't set the same og:url
tag for all of them, Facebook will think it's a different article, hence generating over time crawler requests to all of them instead of just to the one and only canonical URL. More info here.
Hope it helps.
We had same problems on our website/server. The problem was the og:url metatag
. After removing it, the problem was solved for most facebookexternalhit calls.
Another problem was, that some pictures we specified in the og:image tag, were not existing. So the facebookexternhit scraper called every image on the url for each call of the url.