How can one detect the search engine bots using php?
问题:
回答1:
Here\'s a Search Engine Directory of Spider names
Then you use $_SERVER[\'HTTP_USER_AGENT\'];
to check if the agent is said spider.
if(strstr(strtolower($_SERVER[\'HTTP_USER_AGENT\']), \"googlebot\"))
{
// what to do
}
回答2:
I use the following code which seems to be working fine:
function _bot_detected() {
return (
isset($_SERVER[\'HTTP_USER_AGENT\'])
&& preg_match(\'/bot|crawl|slurp|spider|mediapartners/i\', $_SERVER[\'HTTP_USER_AGENT\'])
);
}
update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners
回答3:
Check the $_SERVER[\'HTTP_USER_AGENT\']
for some of the strings listed here:
http://www.useragentstring.com/pages/All/
Or more specifically for crawlers:
http://www.useragentstring.com/pages/Crawlerlist/
If you want to -say- log the number of visits of most common search engine crawlers, you could use
$interestingCrawlers = array( \'google\', \'yahoo\' );
$pattern = \'/(\' . implode(\'|\', $interestingCrawlers) .\')/\';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER[\'HTTP_USER_AGENT\']), $matches, \'i\');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either \'google\' or \'yahoo\'
}
回答4:
You can checkout if it\'s a search engine with this function :
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
\'Google\' => \'Google\',
\'MSN\' => \'msnbot\',
\'Rambler\' => \'Rambler\',
\'Yahoo\' => \'Yahoo\',
\'AbachoBOT\' => \'AbachoBOT\',
\'accoona\' => \'Accoona\',
\'AcoiRobot\' => \'AcoiRobot\',
\'ASPSeek\' => \'ASPSeek\',
\'CrocCrawler\' => \'CrocCrawler\',
\'Dumbot\' => \'Dumbot\',
\'FAST-WebCrawler\' => \'FAST-WebCrawler\',
\'GeonaBot\' => \'GeonaBot\',
\'Gigabot\' => \'Gigabot\',
\'Lycos spider\' => \'Lycos\',
\'MSRBOT\' => \'MSRBOT\',
\'Altavista robot\' => \'Scooter\',
\'AltaVista robot\' => \'Altavista\',
\'ID-Search Bot\' => \'IDBot\',
\'eStyle Bot\' => \'eStyle\',
\'Scrubby robot\' => \'Scrubby\',
\'Facebook\' => \'facebookexternalhit\',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode(\'|\',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
<?php $USER_AGENT = $_SERVER[\'HTTP_USER_AGENT\'];
if(crawlerDetect($USER_AGENT)) return \"no need to lang redirection\";?>
回答5:
Because any client can set the user-agent to what they want, looking for \'Googlebot\', \'bingbot\' etc is only half the job.
The 2nd part is verifying the client\'s IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26
At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it\'s under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site\'s visitor, you\'re sure it\'s a crawler from that search engine.
I\'ve written a library in Java that performs these checks for you. Feel free to port it to PHP. It\'s on GitHub: https://github.com/optimaize/webcrawler-verifier
回答6:
I\'m using this to detect bots:
if (preg_match(\'/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i\', $_SERVER[\'HTTP_USER_AGENT\'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
if (preg_match(\'/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i\', $_SERVER[\'HTTP_USER_AGENT\'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
Note: My whitelist is based on Facebooks robots.txt.
回答7:
I use this function ... part of the regex comes from prestashop but I added some more bot to it.
public function isBot()
{
$bot_regex = \'/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\\-bot\\/0\\.9|boxseabot|bspider|calif|christcrawler|CMC\\/0\\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\\/0\\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\\.kapsi\\.net|KDD\\-Explorer|ko_yappo_robot|label\\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\\-info\\-agent|WebMechanic|NetScoop|newscan\\-online|ObjectsSearch|Occam|Orbsearch\\/1\\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\\/2\\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\\/1\\.0|spiderline|nil|suke|http:\\/\\/www\\.sygol\\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\\-perl|verticrawl|Victoria|void\\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\\/1\\.0|webcatcher|T\\-H\\-U\\-N\\-D\\-E\\-R\\-S\\-T\\-O\\-N\\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\\/Bot|m2e/i\';
$userAgent = empty($_SERVER[\'HTTP_USER_AGENT\']) ? FALSE : $_SERVER[\'HTTP_USER_AGENT\'];
$isBot = !$userAgent || preg_match($bot_regex, $userAgent);
return $isBot;
}
Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )
One distinctive feature of most of the bot is that they don\'t carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )
回答8:
You could analyse the user agent ($_SERVER[\'HTTP_USER_AGENT\']
) or compare the client’s IP address ($_SERVER[\'REMOTE_ADDR\']
) with a list of IP addresses of search engine bots.
回答9:
<?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
$lastupdated = date(\"Ymd\", filemtime(FILE_BOTS));
if ($lastupdated != date(\"Ymd\")) {
$lists = array(
\'http://labs.getyacg.com/spiders/google.txt\',
\'http://labs.getyacg.com/spiders/inktomi.txt\',
\'http://labs.getyacg.com/spiders/lycos.txt\',
\'http://labs.getyacg.com/spiders/msn.txt\',
\'http://labs.getyacg.com/spiders/altavista.txt\',
\'http://labs.getyacg.com/spiders/askjeeves.txt\',
\'http://labs.getyacg.com/spiders/wisenut.txt\',
);
foreach($lists as $list) {
$opt .= fetch($list);
}
$opt = preg_replace(\"/(^[\\r\\n]*|[\\r\\n]+)[\\s\\t]*[\\r\\n]+/\", \"\\n\", $opt);
$fp = fopen(FILE_BOTS,\"w\");
fwrite($fp,$opt);
fclose($fp);
}
$ip = isset($_SERVER[\'REMOTE_ADDR\']) ? $_SERVER[\'REMOTE_ADDR\'] : \'\';
$ref = isset($_SERVER[\'HTTP_REFERER\']) ? $_SERVER[\'HTTP_REFERER\'] : \'\';
$agent = isset($_SERVER[\'HTTP_USER_AGENT\']) ? $_SERVER[\'HTTP_USER_AGENT\'] : \'\';
$host = strtolower(gethostbyaddr($ip));
$file = implode(\" \", file(FILE_BOTS));
$exp = explode(\".\", $ip);
$class = $exp[0].\'.\'.$exp[1].\'.\'.$exp[2].\'.\';
$threshold = CLOAKING_LEVEL;
$cloak = 0;
if (stristr($host, \"googlebot\") && stristr($host, \"inktomi\") && stristr($host, \"msn\")) {
$cloak++;
}
if (stristr($file, $class)) {
$cloak++;
}
if (stristr($file, $agent)) {
$cloak++;
}
if (strlen($ref) > 0) {
$cloak = 0;
}
if ($cloak >= $threshold) {
$cloakdirective = 1;
} else {
$cloakdirective = 0;
}
}
?>
That would be the ideal way to cloak for spiders. It\'s from an open source script called [YACG] - http://getyacg.com
Needs a bit of work, but definitely the way to go.
回答10:
Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector
回答11:
I\'m using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt
and know about new user_agents and put them in your condition of if clause.
$user_agent = strtolower($_SERVER[\'HTTP_USER_AGENT\']);
if(!preg_match(\"/Googlebot|MJ12bot|yandexbot/i\", $user_agent)){
// if not meet the conditions then
// do what you need
// here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
if($user_agent!=\"\"){
$myfile = fopen(\"useragent.txt\", \"a\") or die(\"Unable to open file useragent.txt!\");
fwrite($myfile, $user_agent);
$user_agent = \"\\n\";
fwrite($myfile, $user_agent);
fclose($myfile);
}
}
This is the content of useragent.txt
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
回答12:
function bot_detected() {
if(preg_match(\'/bot|crawl|slurp|spider|mediapartners/i\', $_SERVER[\'HTTP_USER_AGENT\']){
return true;
}
else{
return false;
}
}