检测诚实的网络爬虫检测诚实的网络爬虫(Detecting honest web crawlers)

2019-05-13 09:00发布

我想检测(在服务器端)的请求是由机器人。 我不在乎在这一点上的恶意僵尸程序,只是在玩漂亮的人。 我见过大多涉及对像“机器人”的关键字的用户代理字符串相匹配的几种方法。 但似乎尴尬的,不完整的,难以维护。 因此,没有人有任何更坚实的方法? 如果没有,你有你使用跟上最新的所有友好的用户代理的任何资源?

如果你很好奇:我并不想这样做对任何搜索引擎政策什么。 我们有其中用户随机赠送有几个稍有不同版本的网页的一个网站的一部分。 然而,如果检测到网络爬虫,我们总是给他们同样的版本,这样的指数是一致的。

另外我使用Java,但我想的办法将是任何服务器端技术相似。

Answer 1:

你可以找到数据对在robotstxt.org称为“好”的网络爬虫非常透彻的数据库机器人数据库 。 利用该数据将远远不只是在用户代理匹配机器人更加有效。



Answer 2:

你说对匹配“僵尸”用户代理可能是尴尬的,但我们发现这是一个相当不错的比赛。 我们的研究表明,它会覆盖您收到的点击约98%。 我们还没有碰到任何误报匹配但无论是。 如果你想提高这个高达99.9%,可以包括其他一些知名的赛事,如“履带式”,“baiduspider”,“ia_archiver”,“袅袅”等等。我们已经在我们的生产系统在测试数百万本的命中。

这里为大家介绍一些C#的解决方案:

1)最简单的

是最快的处理小姐时。 即从非漫游器的流量 - 普通用户。 抓住爬虫的99 +%。

bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);

2)培养基

是最快的处理击中时。 即从一个机器人的流量。 蛮快的失误太多。 捕捞量接近爬虫的100%。 匹配“机器人”,“履带”,“蜘蛛”的前期。 您可以添加到任何其他已知的爬虫。

List<string> Crawlers3 = new List<string>()
{
    "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google",
    "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne",            
    "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco",
    "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio",
    "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired",
    "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves",
    "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker",
    "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget",
    "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic",
    "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite",
    "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus",
    "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au",
    "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy",
    "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch",
    "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper",
    "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper",
    "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm",
    "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc"
};
string ua = Request.UserAgent.ToLower();
bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));

3)偏执

是相当快,但比选择1和2。这是最准确的,并允许你,如果你想保持列表慢一点。 如果你怕将来误报你可以保持与他们“机器人”名称的单独列表。 如果我们得到一个简短的比赛我们会记录并检查它的假阳性。

// crawlers that have 'bot' in their useragent
List<string> Crawlers1 = new List<string>()
{
    "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot",
    "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google",
    "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot",
    "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot",
    "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot",
    "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot",
    "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot",
    "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot",
    "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot"
};

// crawlers that don't have 'bot' in their useragent
List<string> Crawlers2 = new List<string>()
{
    "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial",
    "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz",
    "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder",
    "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther",
    "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm",
    "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider",
    "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33",
    "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber",
    "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox",
    "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn",
    "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online",
    "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler",
    "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven",
    "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess",
    "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy",
    "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton",
    "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria",
    "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e",
    "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam",
    "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget",
    "legs","curl","webs","wget","sift","cmc"
};

string ua = Request.UserAgent.ToLower();
string match = null;

if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x));
else match = Crawlers2.FirstOrDefault(x => ua.Contains(x));

if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua);

bool iscrawler = match != null;

笔记:

  • 人们很容易将名称只保留给正则表达式选项1.但是,如果你做到这一点会越来越慢。 如果你想有一个更完整的列表,然后用LINQ拉姆达更快。
  • 确保.ToLower()是你的LINQ方法之外 - 记法是一个循环,你会在每次迭代过程中修改字符串。
  • 始终把最重的机器人在列表的开始,所以它们匹配越快。
  • 把列表成为一个静态类,以便它们不会在每次浏览量重建。

蜜罐

这个唯一真正的选择是建立在您的网站“蜜罐”的链接,只有一个机器人将达到。 然后,您登录击中蜜罐页面数据库中的用户代理字符串。 然后,您可以使用这些记录的字符串进行分类爬虫。

Postives:它将匹配未宣布自己一些不为人知的爬虫。

Negatives:并不是所有的履带挖得足够深,达到您的网站上的每一个环节,所以他们可能不会达到你的蜜罐。



Answer 3:

一个建议是创建页面的空白锚,只有一个机器人将跟随。 普通用户不会看到该链接,让蜘蛛和机器人跟随。 例如,一个指向子空锚标记将记录在日志中的GET请求......

<a href="dontfollowme.aspx"></a>

很多人在运行一个蜜罐以捕获没有关注robots.txt文件的恶意僵尸程序使用此方法。 我用的是空锚方法在ASP.NET蜜罐解决方案我写的陷阱,并阻止这些令人毛骨悚然的爬虫...



Answer 4:

任何访客进入其页面/robots.txt的可能是一个机器人。



Answer 5:

东西快速和肮脏的像,这可能是一个良好的开端:

return if request.user_agent =~ /googlebot|msnbot|baidu|curl|wget|Mediapartners-Google|slurp|ia_archiver|Gigabot|libwww-perl|lwp-trivial/i

注:Rails代码,但正则表达式是普遍适用的。



Answer 6:

我敢肯定机器人的相当大的比例不使用robots.txt的,但是那是我的第一个念头。

在我看来,要检测一个机器人最好的办法是请求之间的时间,如果请求之间的时间是一致的快那么它的一个机器人。



文章来源: Detecting honest web crawlers