how to detect search engine bots with php?

2019-01-01 09:48发布

站内文章 / PHP

53 0

初与友歌

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

How can one detect the search engine bots using php?

回答1:

Here\'s a Search Engine Directory of Spider names

Then you use $_SERVER[\'HTTP_USER_AGENT\']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER[\'HTTP_USER_AGENT\']), \"googlebot\"))
{
    // what to do
}

回答2:

I use the following code which seems to be working fine:

function _bot_detected() {

  return (
    isset($_SERVER[\'HTTP_USER_AGENT\'])
    && preg_match(\'/bot|crawl|slurp|spider|mediapartners/i\', $_SERVER[\'HTTP_USER_AGENT\'])
  );
}

update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en

added mediapartners

回答3:

Check the $_SERVER[\'HTTP_USER_AGENT\'] for some of the strings listed here:

http://www.useragentstring.com/pages/All/

Or more specifically for crawlers:

http://www.useragentstring.com/pages/Crawlerlist/

If you want to -say- log the number of visits of most common search engine crawlers, you could use

$interestingCrawlers = array( \'google\', \'yahoo\' );
$pattern = \'/(\' . implode(\'|\', $interestingCrawlers) .\')/\';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER[\'HTTP_USER_AGENT\']), $matches, \'i\');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either \'google\' or \'yahoo\'
}

回答4:

You can checkout if it\'s a search engine with this function :

<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
\'Google\' => \'Google\',
\'MSN\' => \'msnbot\',
      \'Rambler\' => \'Rambler\',
      \'Yahoo\' => \'Yahoo\',
      \'AbachoBOT\' => \'AbachoBOT\',
      \'accoona\' => \'Accoona\',
      \'AcoiRobot\' => \'AcoiRobot\',
      \'ASPSeek\' => \'ASPSeek\',
      \'CrocCrawler\' => \'CrocCrawler\',
      \'Dumbot\' => \'Dumbot\',
      \'FAST-WebCrawler\' => \'FAST-WebCrawler\',
      \'GeonaBot\' => \'GeonaBot\',
      \'Gigabot\' => \'Gigabot\',
      \'Lycos spider\' => \'Lycos\',
      \'MSRBOT\' => \'MSRBOT\',
      \'Altavista robot\' => \'Scooter\',
      \'AltaVista robot\' => \'Altavista\',
      \'ID-Search Bot\' => \'IDBot\',
      \'eStyle Bot\' => \'eStyle\',
      \'Scrubby robot\' => \'Scrubby\',
      \'Facebook\' => \'facebookexternalhit\',
  );
  // to get crawlers string used in function uncomment it
  // it is better to save it in string than use implode every time
  // global $crawlers
   $crawlers_agents = implode(\'|\',$crawlers);
  if (strpos($crawlers_agents, $USER_AGENT) === false)
      return false;
    else {
    return TRUE;
    }
}
?>

Then you can use it like :

<?php $USER_AGENT = $_SERVER[\'HTTP_USER_AGENT\'];
  if(crawlerDetect($USER_AGENT)) return \"no need to lang redirection\";?>

回答5:

Because any client can set the user-agent to what they want, looking for \'Googlebot\', \'bingbot\' etc is only half the job.

The 2nd part is verifying the client\'s IP. In the old days this required maintaining IP lists. All the lists you find online are outdated. The top search engines officially support verification through DNS, as explained by Google https://support.google.com/webmasters/answer/80553 and Bing http://www.bing.com/webmaster/help/how-to-verify-bingbot-3905dc26

At first perform a reverse DNS lookup of the client IP. For Google this brings a host name under googlebot.com, for Bing it\'s under search.msn.com. Then, because someone could set such a reverse DNS on his IP, you need to verify with a forward DNS lookup on that hostname. If the resulting IP is the same as the one of the site\'s visitor, you\'re sure it\'s a crawler from that search engine.

I\'ve written a library in Java that performs these checks for you. Feel free to port it to PHP. It\'s on GitHub: https://github.com/optimaize/webcrawler-verifier

回答6:

I\'m using this to detect bots:

if (preg_match(\'/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i\', $_SERVER[\'HTTP_USER_AGENT\'])) {
    // is bot
}

In addition I use a whitelist to block unwanted bots:

if (preg_match(\'/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i\', $_SERVER[\'HTTP_USER_AGENT\'])) {
    // allowed bot
}

An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.

Note: My whitelist is based on Facebooks robots.txt.

回答7:

I use this function ... part of the regex comes from prestashop but I added some more bot to it.

    public function isBot()
{
    $bot_regex = \'/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\\-bot\\/0\\.9|boxseabot|bspider|calif|christcrawler|CMC\\/0\\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\\/0\\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\\.kapsi\\.net|KDD\\-Explorer|ko_yappo_robot|label\\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\\-info\\-agent|WebMechanic|NetScoop|newscan\\-online|ObjectsSearch|Occam|Orbsearch\\/1\\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\\/2\\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\\/1\\.0|spiderline|nil|suke|http:\\/\\/www\\.sygol\\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\\-perl|verticrawl|Victoria|void\\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\\/1\\.0|webcatcher|T\\-H\\-U\\-N\\-D\\-E\\-R\\-S\\-T\\-O\\-N\\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|SeznamBot|mod_pagespeed|CCBot|openstat.ru\\/Bot|m2e/i\';
    $userAgent = empty($_SERVER[\'HTTP_USER_AGENT\']) ? FALSE : $_SERVER[\'HTTP_USER_AGENT\'];
    $isBot = !$userAgent || preg_match($bot_regex, $userAgent);

    return $isBot;
}

Anyway take care that some bots uses browser like user agent to fake their identity
( I got many russian ip that has this behaviour on my site )

One distinctive feature of most of the bot is that they don\'t carry any cookie and so no session is attached to them.
( I am not sure how but this is for sure the best way to track them )

回答8:

You could analyse the user agent ($_SERVER[\'HTTP_USER_AGENT\']) or compare the client’s IP address ($_SERVER[\'REMOTE_ADDR\']) with a list of IP addresses of search engine bots.

回答9:

 <?php // IPCLOACK HOOK
if (CLOAKING_LEVEL != 4) {
    $lastupdated = date(\"Ymd\", filemtime(FILE_BOTS));
    if ($lastupdated != date(\"Ymd\")) {
        $lists = array(
        \'http://labs.getyacg.com/spiders/google.txt\',
        \'http://labs.getyacg.com/spiders/inktomi.txt\',
        \'http://labs.getyacg.com/spiders/lycos.txt\',
        \'http://labs.getyacg.com/spiders/msn.txt\',
        \'http://labs.getyacg.com/spiders/altavista.txt\',
        \'http://labs.getyacg.com/spiders/askjeeves.txt\',
        \'http://labs.getyacg.com/spiders/wisenut.txt\',
        );
        foreach($lists as $list) {
            $opt .= fetch($list);
        }
        $opt = preg_replace(\"/(^[\\r\\n]*|[\\r\\n]+)[\\s\\t]*[\\r\\n]+/\", \"\\n\", $opt);
        $fp =  fopen(FILE_BOTS,\"w\");
        fwrite($fp,$opt);
        fclose($fp);
    }
    $ip = isset($_SERVER[\'REMOTE_ADDR\']) ? $_SERVER[\'REMOTE_ADDR\'] : \'\';
    $ref = isset($_SERVER[\'HTTP_REFERER\']) ? $_SERVER[\'HTTP_REFERER\'] : \'\';
    $agent = isset($_SERVER[\'HTTP_USER_AGENT\']) ? $_SERVER[\'HTTP_USER_AGENT\'] : \'\';
    $host = strtolower(gethostbyaddr($ip));
    $file = implode(\" \", file(FILE_BOTS));
    $exp = explode(\".\", $ip);
    $class = $exp[0].\'.\'.$exp[1].\'.\'.$exp[2].\'.\';
    $threshold = CLOAKING_LEVEL;
    $cloak = 0;
    if (stristr($host, \"googlebot\") && stristr($host, \"inktomi\") && stristr($host, \"msn\")) {
        $cloak++;
    }
    if (stristr($file, $class)) {
        $cloak++;
    }
    if (stristr($file, $agent)) {
        $cloak++;
    }
    if (strlen($ref) > 0) {
        $cloak = 0;
    }

    if ($cloak >= $threshold) {
        $cloakdirective = 1;
    } else {
        $cloakdirective = 0;
    }
}
?>

That would be the ideal way to cloak for spiders. It\'s from an open source script called [YACG] - http://getyacg.com

Needs a bit of work, but definitely the way to go.

回答10:

Use Device Detector open source library, it offers a isBot() function: https://github.com/piwik/device-detector

回答11:

I\'m using this code, pretty good. You will very easy to know user-agents visitted your site. This code is opening a file and write the user_agent down the file. You can check each day this file by go to yourdomain.com/useragent.txt and know about new user_agents and put them in your condition of if clause.

$user_agent = strtolower($_SERVER[\'HTTP_USER_AGENT\']);
if(!preg_match(\"/Googlebot|MJ12bot|yandexbot/i\", $user_agent)){
    // if not meet the conditions then
    // do what you need

    // here open a file and write the user_agent down the file. You can check each day this file useragent.txt and know about new user_agents and put them in your condition of if clause
    if($user_agent!=\"\"){
        $myfile = fopen(\"useragent.txt\", \"a\") or die(\"Unable to open file useragent.txt!\");
        fwrite($myfile, $user_agent);
        $user_agent = \"\\n\";
        fwrite($myfile, $user_agent);
        fclose($myfile);
    }
}

This is the content of useragent.txt

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; MJ12bot/v1.4.6; http://mj12bot.com/)Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (compatible; yandexbot/3.0; +http://yandex.com/bots)
mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; linkdexbot/2.2; +http://www.linkdex.com/bots/)
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64; rv:49.0) gecko/20100101 firefox/49.0
mozilla/5.0 (windows nt 6.1; wow64; rv:33.0) gecko/20100101 firefox/33.0
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/53.0.2785.143 safari/537.36
mozilla/5.0 (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html)
zoombot (linkbot 1.0 http://suite.seozoom.it/bot.html)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174
sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
mozilla/5.0 (windows nt 10.0; wow64) applewebkit/537.36 (khtml, like gecko) chrome/44.0.2403.155 safari/537.36 opr/31.0.1889.174

回答12:

function bot_detected() {

  if(preg_match(\'/bot|crawl|slurp|spider|mediapartners/i\', $_SERVER[\'HTTP_USER_AGENT\']){
    return true;
  }
  else{
    return false;
  }
}

标签： php web-crawler bots

初与友歌

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~