excessive traffic from facebookexternalhit bot

2019-01-11 20:38发布

问题:

Does anyone know how tell the 'facebookexternalhit' bot to spread its traffic?

Our website gets hammered every 45 - 60 minutes with spikes of approx. 400 requests per second, from 20 to 30 different IP addresses from the facebook netblocks. Between the spikes the traffic does not disappear, but the load is acceptable. Offcourse we do not want to block the bot, but these spikes are risky. We'd prefer to see the bot spread it's load equally over time. And see it behave like Googlebot & friends.

I've seen related bug reports ( First Bug, Second Bug and Third Bug (#385275384858817)), but could not find any suggestions how to manage the load.

回答1:

Per other answers, the semi-official word from Facebook is "suck it". It boggles me they cannot follow Crawl-delay (yes, I know it's not a "crawler", however GET'ing 100 pages in a few seconds is a crawl, whatever you want to call it).

Since one cannot appeal to their hubris, and DROP'ing their IP block is pretty draconian, here is my technical solution.

In PHP, execute the following code as quickly as possible for every request.

define( 'FACEBOOK_REQUEST_THROTTLE', 2.0 ); // Number of seconds permitted between each hit from facebookexternalhit

if( !empty( $_SERVER['HTTP_USER_AGENT'] ) && preg_match( '/^facebookexternalhit/', $_SERVER['HTTP_USER_AGENT'] ) ) {
    $fbTmpFile = sys_get_temp_dir().'/facebookexternalhit.txt';
    if( $fh = fopen( $fbTmpFile, 'c+' ) ) {
        $lastTime = fread( $fh, 100 );
        $microTime = microtime( TRUE );
        // check current microtime with microtime of last access
        if( $microTime - $lastTime < FACEBOOK_REQUEST_THROTTLE ) {
            // bail if requests are coming too quickly with http 503 Service Unavailable
            header( $_SERVER["SERVER_PROTOCOL"].' 503' );
            die;
        } else {
            // write out the microsecond time of last access
            rewind( $fh );
            fwrite( $fh, $microTime );
        }
        fclose( $fh );
    } else {
        header( $_SERVER["SERVER_PROTOCOL"].' 503' );
        die;
    }
}

You can test this from a command line with something like:

$ rm index.html*; wget -U "facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)" http://www.foobar.com/; less index.html

Improvement suggestions are welcome... I would guess their might be some concurrency issues with a huge blast.



回答2:

I know it's an old, but unanswered, question. I hope this answer helps someone.

There's an Open Graph tag named og:ttl that allows you to slow down the requests made by the Facebook crawler: (reference)

Crawler rate limiting You can label pages and objects to change how long Facebook's crawler will wait to check them for new content. Use the og:ttl object property to limit crawler access if our crawler is being too aggressive.

Checking object properties for og:ttl states that the default ttl is 30 days for each canonical URL shared. So setting this ttl meta tag will only slow requests down if you have a very large amount of shared objects over time.

But, if you're being reached by Facebook's crawler because of actual live traffic (users sharing a lot of your stories at the same time), this will of course not work.

Another possibility for you to have too many crawler requests, is that your stories are not being shared using a correct canonical url (og:url) tag. Let's say, your users can reach certain article on your site from several different sources (actually being able to see and share the same article, but the URL they see is different), if you don't set the same og:url tag for all of them, Facebook will think it's a different article, hence generating over time crawler requests to all of them instead of just to the one and only canonical URL. More info here.

Hope it helps.



回答3:

We had same problems on our website/server. The problem was the og:url metatag. After removing it, the problem was solved for most facebookexternalhit calls.

Another problem was, that some pictures we specified in the og:image tag, were not existing. So the facebookexternhit scraper called every image on the url for each call of the url.