其他一些网站使用卷曲和假HTTP引用复制我的网站内容。我们有没有办法来检测卷曲或不是真正的网络浏览器？

Answer 1:

There is no magic solution to avoid automatic crawling. Everyting a human can do, a robot can do it too. There are only solutions to make the job harder, so hard that only strong skilled geeks may try to pass them.

I was in trouble too some years ago and my first advice is, if you have time, be a crawler yourself (I assume a "crawler" is the guy who crawls your website), this is the best school for the subject. By crawling several websites, I learned different kind of protections, and by associating them I’ve been efficient.

I give you some examples of protections you may try.

Sessions per IP

If a user uses 50 new sessions each minute, you can think this user could be a crawler who does not handle cookies. Of course, curl manages cookies perfectly, but if you couple it with a visit counter per session (explained later), or if your crawler is a noobie with cookie matters, it may be efficient.

It is difficult to imagine that 50 people of the same shared connection will get simultaneousely on your website (it of course depends on your traffic, that is up to you). And if this happens you can lock pages of your website until a captcha is filled.

Idea :

1) you create 2 tables : 1 to save banned ips and 1 to save ip and sessions

create table if not exists sessions_per_ip (
  ip int unsigned,
  session_id varchar(32),
  creation timestamp default current_timestamp,
  primary key(ip, session_id)
);

create table if not exists banned_ips (
  ip int unsigned,
  creation timestamp default current_timestamp,
  primary key(ip)
);

2) at the beginning of your script, you delete entries too old from both tables

3) next you check if ip of your user is banned or not (you set a flag to true)

4) if not, you count how much he has sessions for his ip

5) if he has too much sessions, you insert it in your banned table and set a flag

6) you insert his ip on the sessions per ip table if it has not been already inserted

I wrote a code sample to show in a better way my idea.

<?php

try
{

    // Some configuration (small values for demo)
    $max_sessions = 5; // 5 sessions/ip simultaneousely allowed
    $check_duration = 30; // 30 secs max lifetime of an ip on the sessions_per_ip table
    $lock_duration = 60; // time to lock your website for this ip if max_sessions is reached

    // Mysql connection
    require_once("config.php");
    $dbh = new PDO("mysql:host={$host};dbname={$base}", $user, $password);
    $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

    // Delete old entries in tables
    $query = "delete from sessions_per_ip where timestampdiff(second, creation, now()) > {$check_duration}";
    $dbh->exec($query);

    $query = "delete from banned_ips where timestampdiff(second, creation, now()) > {$lock_duration}";
    $dbh->exec($query);

    // Get useful info attached to our user...
    session_start();
    $ip = ip2long($_SERVER['REMOTE_ADDR']);
    $session_id = session_id();

    // Check if IP is already banned
    $banned = false;
    $count = $dbh->query("select count(*) from banned_ips where ip = '{$ip}'")->fetchColumn();
    if ($count > 0)
    {
        $banned = true;
    }
    else
    {
        // Count entries in our db for this ip
        $query = "select count(*)  from sessions_per_ip where ip = '{$ip}'";
        $count = $dbh->query($query)->fetchColumn();
        if ($count >= $max_sessions)
        {
            // Lock website for this ip
            $query = "insert ignore into banned_ips ( ip ) values ( '{$ip}' )";
            $dbh->exec($query);
            $banned = true;
        }

        // Insert a new entry on our db if user's session is not already recorded
        $query = "insert ignore into sessions_per_ip ( ip, session_id ) values ('{$ip}', '{$session_id}')";
        $dbh->exec($query);
    }

    // At this point you have a $banned if your user is banned or not.
    // The following code will allow us to test it...

    // We do not display anything now because we'll play with sessions :
    // to make the demo more readable I prefer going step by step like
    // this.
    ob_start();

    // Displays your current sessions
    echo "Your current sessions keys are : <br/>";
    $query = "select session_id from sessions_per_ip where ip = '{$ip}'";
    foreach ($dbh->query($query) as $row) {
        echo "{$row['session_id']}<br/>";
    }

    // Display and handle a way to create new sessions
    echo str_repeat('<br/>', 2);
    echo '<a href="' . basename(__FILE__) . '?new=1">Create a new session / reload</a>';
    if (isset($_GET['new']))
    {
        session_regenerate_id();
        session_destroy();
        header("Location: " . basename(__FILE__));
        die();
    }

    // Display if you're banned or not
    echo str_repeat('<br/>', 2);
    if ($banned)
    {
        echo '<span style="color:red;">You are banned: wait 60secs to be unbanned... a captcha must be more friendly of course!</span>';
        echo '<br/>';
        echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
    }
    else
    {
        echo '<span style="color:blue;">You are not banned!</span>';
        echo '<br/>';
        echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
    }
    ob_end_flush();
}
catch (PDOException $e)
{
    /*echo*/ $e->getMessage();
}

?>

Visit Counter

If your user uses the same cookie to crawl your pages, you’ll be able to use his session to block it. This idea is quite simple: is it possible that your user visits 60 pages in 60 seconds?

Idea :

Create an array in the user session, it will contains visit time()s.
Remove visits older than X seconds in this array
Add a new entry for the actual visit
Count entries in this array
Ban your user if he visited Y pages

Sample code :

<?php

$visit_counter_pages = 5; // maximum number of pages to load
$visit_counter_secs = 10; // maximum amount of time before cleaning visits

session_start();

// initialize an array for our visit counter
if (array_key_exists('visit_counter', $_SESSION) == false)
{
    $_SESSION['visit_counter'] = array();
}

// clean old visits
foreach ($_SESSION['visit_counter'] as $key => $time)
{
    if ((time() - $time) > $visit_counter_secs) {
        unset($_SESSION['visit_counter'][$key]);
    }
}

// we add the current visit into our array
$_SESSION['visit_counter'][] = time();

// check if user has reached limit of visited pages
$banned = false;
if (count($_SESSION['visit_counter']) > $visit_counter_pages)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
$count = count($_SESSION['visit_counter']);
echo "You visited {$count} pages.";
echo str_repeat('<br/>', 2);

echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned! Wait for a short while (10 secs in this demo)...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

An image to download

When a crawler need to do his dirty work, that’s for a large amount of data, and in a shortest possible time. That’s why they don’t download images on pages ; it takes too much bandwith and makes the crawling slower.

This idea (I think the most elegent and the most easy to implement) uses the mod_rewrite to hide code in a .jpg/.png/… an image file. This image should be available on each page you want to protect : it could be your logo website, but you’ll choose a small-sized image (because this image must not be cached).

Idea :

1/ Add those lines to your .htaccess

RewriteEngine On
RewriteBase /tests/anticrawl/
RewriteRule ^logo\.jpg$ logo.php

2/ Create your logo.php with the security

<?php

// start session and reset counter
session_start();
$_SESSION['no_logo_count'] = 0;

// forces image to reload next time
header("Cache-Control: no-store, no-cache, must-revalidate");

// displays image
header("Content-type: image/jpg");
readfile("logo.jpg");
die();

3/ Increment your no_logo_count on each page you need to add security, and check if it reached your limit.

Sample code :

<?php

$no_logo_limit = 5; // number of allowd pages without logo

// start session and initialize
session_start();
if (array_key_exists('no_logo_count', $_SESSION) == false)
{
    $_SESSION['no_logo_count'] = 0;
}
else
{
    $_SESSION['no_logo_count']++;
}

// check if user has reached limit of "undownloaded image"
$banned = false;
if ($_SESSION['no_logo_count'] >= $no_logo_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "You did not loaded image {$_SESSION['no_logo_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display "show image" link : note that we're using .jpg file
echo <<< EOT

<div id="image_container">
    <a id="image_load" href="#">Load image</a>
</div>
<br/>

<script type="text/javascript">

  // On your implementation, you'llO of course use <img src="logo.jpg" />
  $('#image_load').click(function(e) {
    e.preventDefault();
    $('#image_load').html('<img src="logo.jpg" />');
  });

</script>

EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "load image" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

Cookie check

You can create cookies in the javascript side to check if your users does interpret javascript (a crawler using Curl does not, for example).

The idea is quite simple : this is about the same as an image check.

Set a $_SESSION value to 1 and increment it in each visits
if a cookie (set in JavaScript) does exist, set session value to 0
if this value reached a limit, ban your user

Code :

<?php

$no_cookie_limit = 5; // number of allowd pages without cookie set check

// Start session and reset counter
session_start();

if (array_key_exists('cookie_check_count', $_SESSION) == false)
{
    $_SESSION['cookie_check_count'] = 0;
}

// Initializes cookie (note: rename it to a more discrete name of course) or check cookie value
if ((array_key_exists('cookie_check', $_COOKIE) == false) || ($_COOKIE['cookie_check'] != 42))
{
    // Cookie does not exist or is incorrect...
    $_SESSION['cookie_check_count']++;
}
else
{
    // Cookie is properly set so we reset counter
    $_SESSION['cookie_check_count'] = 0;
}

// Check if user has reached limit of "cookie check"
$banned = false;
if ($_SESSION['cookie_check_count'] >= $no_cookie_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "Cookie check failed {$_SESSION['cookie_check_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<br/>
<a id="reload" href="#">Reload</a>
<br/>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

// Display "set cookie" link
echo <<< EOT

<br/>
<a id="cookie_link" href="#">Set cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#cookie_link').click(function(e) {
    e.preventDefault();
    var expires = new Date();
    expires.setTime(new Date().getTime() + 3600000);
    document.cookie="cookie_check=42;expires=" + expires.toGMTString();
  });

</script>
EOT;


// Display "unset cookie" link
echo <<< EOT

<br/>
<a id="unset_cookie" href="#">Unset cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#unset_cookie').click(function(e) {
    e.preventDefault();
    document.cookie="cookie_check=;expires=Thu, 01 Jan 1970 00:00:01 GMT";
  });

</script>
EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "Set cookie" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}

Protection against proxies

Some words about the different kind of proxies we may find over the web :

A “normal” proxy displays information about user connection (notably, his IP)
An anonymous proxy does not display IP, but gives information about proxy usage on header.
A high-anonyous proxy do not display user IP, and do not display any information that a browser may not send.

It is easy to find a proxy to connect any website, but it is very hard to find high-anonymous proxies.

Some $_SERVER variables may contain keys specifically if your users is behind a proxy (exhaustive list took from this question):

CLIENT_IP
FORWARDED
FORWARDED_FOR
FORWARDED_FOR_IP
HTTP_CLIENT_IP
HTTP_FORWARDED
HTTP_FORWARDED_FOR
HTTP_FORWARDED_FOR_IP
HTTP_PC_REMOTE_ADDR
HTTP_PROXY_CONNECTION'
HTTP_VIA
HTTP_X_FORWARDED
HTTP_X_FORWARDED_FOR
HTTP_X_FORWARDED_FOR_IP
HTTP_X_IMFORWARDS
HTTP_XROXY_CONNECTION
VIA
X_FORWARDED
X_FORWARDED_FOR

如果您发现您的这些键之一，你可以给不同的行为（下限等），以您的防爬证券$_SERVER变量。

结论

有很多的方法来检测你的网站滥用，所以你会发现一定的解决方案。但是，你要知道你的网站是如何准确使用，让您的证券不会是积极的与你的“正常”的用户。

Answer 2:

记住：HTTP是不是魔术。有规定的一组与每个HTTP请求发送报头; 包括卷曲（libcurl中和） - 如果这些报头由Web浏览器发送的，他们可以也被任何程序发送。

有的认为这是一个诅咒，但在另一方面，这是一个祝福，因为它大大简化了Web应用程序的功能测试。

更新：由于unr3al011正确地注意到，袅袅不执行JavaScript，所以理论上它可以创建一个页面的抓取器查看时（例如，具有设定以及后来的检查具体的cookie被JS手段），将表现不同。

尽管如此，这将会是一个非常脆弱的防线。该页面的数据仍然必须从服务器抓起-这HTTP请求（和它总是 HTTP请求）可以通过卷曲进行仿真。检查这个答案例如如何战胜这样的防守。

......我甚至没有提到一些采集卡能够执行JavaScript。）

Answer 3:

避免假冒参照网址的方式跟踪用户

您可以通过一种或多种方法，这种跟踪用户：

保存在客户端浏览器cookie中有一些特殊的代码（例如：最后访问的URL，时间戳），并验证它在你的服务器的每个响应。
和以前一样，但使用的会话，而不是明确的饼干

饼干你应该增加像密码的安全性。

[Cookie]
url => http://someurl/
hash => dsafdshfdslajfd

哈希在PHP这样计算方法为

$url = $_COOKIE['url'];
$hash = $_COOKIE['hash'];
$secret = 'This is a fixed secret in the code of your application';

$isValidCookie = (hash('algo', $secret . $url) === $hash);

$isValidReferer = $isValidCookie & ($_SERVER['HTTP_REFERER'] === $url)

Answer 4:

您可以通过以下方法检测卷曲的UserAgent。但要注意的用户代理可以由用户重写，反正默认设置可以通过认可：

function is_curl() {
    if (stristr($_SERVER["HTTP_USER_AGENT"], 'curl'))
        return true;
}

Answer 5:

如一些人所提到的卷曲不能执行JavaScritp（据我所知），所以你可能尝试设置成才像raina77ow建议但不会wokrk其他采集卡/ donwloaders。

我建议你尝试建立一个机器人陷阱你处理采集卡/下载者可以执行JavaScript的方式。

我不知道有任何1个解决方案完全避免这种情况，所以我最好的建议是尝试多种解决方案：

1）只允许已知用户代理，如在你的.htaccess文件中的所有主流浏览器

2）设置你的robots.txt防止机器人

3）建立了机器人僵尸陷阱，不尊重robots.txt文件

Answer 6:

投入的根文件夹，因为这.htaccess文件。它可以帮助。我发现它在一个虚拟主机提供商的网站，但不知道这意味着什么:)

SetEnvIf User-Agent ^Teleport graber   
SetEnvIf User-Agent ^w3m graber    
SetEnvIf User-Agent ^Offline graber   
SetEnvIf User-Agent Downloader graber  
SetEnvIf User-Agent snake graber  
SetEnvIf User-Agent Xenu graber   
Deny from env=graber