How can I resist the bad unidentified bots to crawl my website? Some bad bots whose name is not present in cPanel of Apache are badly accessing my website bandwidth.
I had tried robots.txt on batgap.com/robots.txt and also blocked with .htaccess but there is no improvement in bandwidth usage. I don't know the IP of those bots so unable to block them by IP address. These bots are consuming too much bandwidth of site and hence a result I need to increase it from server.
I'm from Incapsula and we deal with bad bots on a regular basis.
We've recently release a bot-related research that provides insights of the scope of the problem ( http://www.incapsula.com/the-incapsula-blog/item/225-what-google-doesnt-show-you-31-of-website-traffic-can-harm-your-business ) and in light of this data I have to agree with @Leonard Challis - you simply can not handle bot protection manually.
Having said that, there are bot protection solutions, even Free ones (us included) that can help you with bad bots.
BTW - Just like you mentioned, one byproduct of bad bots visits is a loss of bandwidth.
We`ve recently became aware of just how surprisingly HUGE bot-related bandwidth usage really is.
This is an interesting topic by itself.
We believe that by avoiding bad bot traffic, hosting providers can actually greatly improve their efficiency (hopefully using this to drop cost or to improve services). Once you imagine Social and Business implication of this you can understand the real scope of this bad bot problem that goes way beyond the immediate damage done.
I block 'bad bots' by using PHP.
I filter in IP address primarily, then by User-Agent secondarily.
I make the 'bad bot' wait for up to 999 seconds, then return a very small web page.
Usually (always) the internet connection times-out and zero (0) bytes are returned.
Best of all I have delayed them for a few minutes before the get to the next victim.
http://gelm.net/How-to-block-Baidu-with-PHP.htm
Unfortunately robots.txt is sometimes ignored by these "bad bots", though if the problem is more things like genuine search engine spiders that you don't want to see they ought to take it in to account. I presume with CPanel you can get in to the web server (apache) logs? In there you can look for two things: the IP and the User-Agent. You can find the culprits in there and add them to your robots.txt and .htaccess. Note that .htaccess rules denying IP addresses are far better that just relying on robots.txt because you are taking the choice out of the bot creator's hands.
If you know specific bots which are doing this you should be able to get IP addresses and user-agents from forums, but if it's a more general thing then really I'm afraid it's more of a manual job.
There are other methods that can be used with varying effect, such as mod_security (http://www.askapache.com/htaccess/modsecurity-htaccess-tricks.html) but this will mean you'll have to access your web server configuration.
Finally, you can check the links that are pointing to your web site (using the link: option on google). Sometimes if you have links on spammy forums or the like this can increase the chances of bots coming to get you. Maybe you can look at the referer URL in the apache logs - but this is all based on a lot of presumptions and you'd probably be lucky if it had a great effect.
Block Unwanted Robots/Spiders visitors via PHP
Instructions:
Place the following PHP Code in the beginning of your index.php file.
The idea here is to place the code in the main site's PHP home page, the main entry point of the site.
If you have other PHP files that are accessed directly via an URL (not including PHP include or require support type files), then place the code in the beginning of those files.
For most PHP sites and PHP CMS sites, the root's index.php file is the file that is the main entry point of the site.
Keep in mind that your site statistics, i.e. AWStats, will still log the hits under Unknown robot (identified by 'bot' followed by a space or one of the following characters _+:,.;/-), but these bots will be blocked from accessing your site's content.
<?php
// ---------------------------------------------------------------------------------------------------------------
// Banned IP Addresses and Bots - Redirects banned visitors who make it past the .htaccess and or robots.txt files to an URL.
// The $banned_ip_addresses array can contain both full and partial IP addresses, i.e. Full = 123.456.789.101, Partial = 123.456.789. or 123.456. or 123.
// Use partial IP addresses to include all IP addresses that begin with a partial IP addresses. The partial IP addresses must end with a period.
// The $banned_bots, $banned_unknown_bots, and $good_bots arrays should contain keyword strings found within the User Agent string.
// The $banned_unknown_bots array is used to identify unknown robots (identified by 'bot' followed by a space or one of the following characters _+:,.;/\-).
// The $good_bots array contains keyword strings used as exemptions when checking for $banned_unknown_bots. If you do not want to utilize the $good_bots array such as
// $good_bots = array(), then you must remove the the keywords strings 'bot.','bot/','bot-' from the $banned_unknown_bots array or else the good bots will also be banned.
$banned_ip_addresses = array('41.','64.79.100.23','5.254.97.75','148.251.236.167','88.180.102.124','62.210.172.77','45.','195.206.253.146');
$banned_bots = array('.ru','AhrefsBot','crawl','crawler','DotBot','linkdex','majestic','meanpath','PageAnalyzer','robot','rogerbot','semalt','SeznamBot','spider');
$banned_unknown_bots = array('bot ','bot_','bot+','bot:','bot,','bot;','bot\\','bot.','bot/','bot-');
$good_bots = array('Google','MSN','bing','Slurp','Yahoo','DuckDuck');
$banned_redirect_url = 'http://english-1329329990.spampoison.com';
// Visitor's IP address and Browser (User Agent)
$ip_address = $_SERVER['REMOTE_ADDR'];
$browser = $_SERVER['HTTP_USER_AGENT'];
// Declared Temporary Variables
$ipfound = $piece = $botfound = $gbotfound = $ubotfound = '';
// Checks for Banned IP Addresses and Bots
if($banned_redirect_url != ''){
// Checks for Banned IP Address
if(!empty($banned_ip_addresses)){
if(in_array($ip_address, $banned_ip_addresses)){$ipfound = 'found';}
if($ipfound != 'found'){
$ip_pieces = explode('.', $ip_address);
foreach ($ip_pieces as $value){
$piece = $piece.$value.'.';
if(in_array($piece, $banned_ip_addresses)){$ipfound = 'found'; break;}
}
}
if($ipfound == 'found'){header("location: $banned_redirect_url"); exit();}
}
// Checks for Banned Bots
if(!empty($banned_bots)){
foreach ($banned_bots as $bbvalue){
$pos1 = stripos($browser, $bbvalue);
if($pos1 !== false){$botfound = 'found'; break;}
}
if($botfound == 'found'){header("location: $banned_redirect_url"); exit();}
}
// Checks for Banned Unknown Bots
if(!empty($good_bots)){
foreach ($good_bots as $gbvalue){
$pos2 = stripos($browser, $gbvalue);
if($pos2 !== false){$gbotfound = 'found'; break;}
}
}
if($gbotfound != 'found'){
if(!empty($banned_unknown_bots)){
foreach ($banned_unknown_bots as $bubvalue){
$pos3 = stripos($browser, $bubvalue);
if($pos3 !== false){$ubotfound = 'found'; break;}
}
if($ubotfound == 'found'){header("location: $banned_redirect_url"); exit();}
}
}
}
// ---------------------------------------------------------------------------------------------------------------
?>