Exclude bots and spiders from a View counter in PH

2019-03-27 19:17发布

问题:

I have built a pretty basic advertisement manager for a website in PHP.

I say basic because it's not complex like Google or Facebook ads or even most high end ad servers. Doesn't handle payments or anything or even targeting users.

It serves the purpose for my low traffic site though to simply show a random banner ad, count impression views and clicks.

Features:

  • Ad slot/position on page
  • Banner image
  • Name
  • View/impression counter
  • Click counter
  • Start and end date, or never ending
  • Disable/enable ad

I am wanting to gradually add more functionality to the system though.

One thing I have noticed is the Impressions/views counter often seems inflated.

I believe the cause of this is from Social networks' spiders and bots as well as search engine spiders.

For example, if someone enters a URL from a page on my website into Facebook, Google+, Twitter, LinkedIn, Pinterest, and other networks, those sites will often spider my site to gather the webpages Title, images, and description.

I would really like to be able to disable this from counting as Advertisement impressions/view counts when an actual human is not viewing the page.

I realize this will be very hard to detect all these but if there is a way to get a majority of them, at least it will make my stats a little more accurate.

So I am reaching out for any help or ideas on how to achieve my goal? Please do not say to use another advertisement system, that is not in the cards, thank you

回答1:

You need to serve the ADs with JavaScript. That's the only way to avoid most of the crawlers. Only browsers load dependencies like Images, JS and CSS. 99% of the robots avoid them.

You can also do this:

// basic crawler detection and block script (no legit browser should match this)
if(!empty($_SERVER['HTTP_USER_AGENT']) and preg_match('~(bot|crawl)~i', $_SERVER['HTTP_USER_AGENT'])){
    // this is a crawler and you should not show ads here
}

You'll have much better stats this way. Use JS for ads.

PS: You could also try setting a cookie in JS and later checking for it. Crawlers might get cookies sent in PHP by HTTP but those set in JS, 99.9% chances they'll miss it. Because they need to load a JS file and interpret it. That's only done by browsers.



回答2:

You could do something like this: There is a good list of crawlers in text format here: http://www.robotstxt.org/db/all.txt

assume you've collected all of the user agents in that file in an array called $botList

$ua = isset($_SERVER['HTTP_USER_AGENT']) ? strtolower($_SERVER['HTTP_USER_AGENT']) : NULL;

if($ua && in_array($ua, $botList)) {
  // this is probably a bot
}

Of course, user agent easily can be changed or may be missing sometimes, but search engines like Google and Yahoo are honest about themselves.



回答3:

A crawler will download robots.txt, even if it doesn't respect it and does it out of curiosity. This is a good indication you might be dealing with one, although it's not definite.

You can detect a crawler if he visits a huge number of links in a very short time. This can be quite complicated to do in code though.

But that's only feasible if you don't want or can't run Javascript. Otherwise go with CodeAngry's answer.


Edit: In response to @keune's answer, you could keep all the visitor IPs and run them through the list in a cron job, then publish the updated visitor count.



回答4:

Try this:

if (preg_match("/^(Mozilla|Opera|PSP|Bunjalloo|wii)/i", $_SERVER['HTTP_USER_AGENT']) && !preg_match("/bot|crawl|crawler|slurp|spider|link|checker|script|robot|discovery|preview/i", $_SERVER['HTTP_USER_AGENT'])) {
    It's not a bot
} else {
    It's a bot
}