How to prevent crawlers depending on XPath from ge

2019-02-10 01:56发布

This question already has an answer here:

There is a library of php that makes evreybody able to attacks me (something like cURL). Then i have a idea to prevent it, I want to use dynamic class name for my elements. look at this:

<div class="<?php $ClassName ?>">anything</div> // $className is taken from the database

Note: $ClassName will vary evry time.

In this case, anyone don't know what is my class name to select my element and then copy my data. Now i have two problem:

  1. How can I communicate between $ClassName and .$ClassName (in css file)? in other words, how can i use php variable for css class names ? (dynamic css classes)
  2. Is it optimized to take all class names from database ?!

3条回答
劳资没心,怎么记你
2楼-- · 2019-02-10 02:13
  1. Define your class in CSS in your page:
<style>
    .<?php echo $ClassName;?>{
      /* Your CSS */
     }
</style>`
  1. Just make $ClassName as random generated string, you don't need to connect to the database.

Update

Building on bishop answer, you can add changeable DOM structure to your document. You have to introduce two PHP variable such as $start and $close. The $start will have a random opening tags such as <span><div><p> and $close their close, </p></div></span> then enclose your document between them

<?php echo $start;?><div class="<?php $ClassName ?>">anything</div><?php echo close;?>
查看更多
何必那么认真
3楼-- · 2019-02-10 02:18

Using the database to get the class name is not optimal until it can be done locally. You should define a array of all class names, and then pick one up them by array_rand, some thing like this:

// php code
   <?php
     $classes = array('class1','class2','class3','class4'); 
     $class_name = $classes[array_rand($classes)];
   ?>


// html code
     <div class="<? php echo $class_name; ?>">anything</div>


// css code
   <style>
     .<? php echo $class_name; ?> {
      // your css codes
     }
   </style>

Note: you must know that you can't use php codes at .css file, then you should write all css codes that you want to be dynamic in your .php file and use <style> stuff </style>.

Meanwhilem, as @sємsєм said, you can creat dynamic html tags.

Some thing like this: (full code)

// php code
   <?php
     // dynamic class
     $classes = array('class1','class2','class3','class4'); 
     $class_name = $classes[array_rand($classes)];

     // dynamic tags
     $tags_statr = array('','<div>','<div><div>','<div><p>','<span><div>');
     $tags_end = array('','</div>','</div></div>','</div></p>','</span></div>');
     $numb = array_rand($tags_statr);
   ?>


// html code
     <?php echo $tags_statr[$numb]; ?>
     <div class="<? php echo $class_name; ?>">anything</div>
     <?php echo $tags_end[$numb]; ?>


// css code
   <style>
     .<? php echo $class_name; ?> {
      // your css codes
     }
   </style>

And for higher security, You can put your content (Here 'anything') (in addition to the external dynamic tags). for example:

<span1>anything</span1> // <span1> changed to <span2,3,4....>

In this case, the adjacent tag with data is also dynamic, And this makes it harder for crawlers.

Finally, I must say that you can't prevent crawlers utterly, you just make it difficult. If you really want to protect your data, you can do things like them:

  • Increased restrictions for users. (e.g Only registered users can see important information)
  • Monitor IP that uses of your website (and if suspicious, block it)
  • Use relevant software. (e.g To limit the search for an IP on a daily basis)
查看更多
我命由我不由天
4楼-- · 2019-02-10 02:32

Sorry to say, but your effort will be wasted. Even if the class name randomly changes, your DOM can still be attacked positionally, like: div + div > span > a.

But even if you rotated your positions (by eg adding spurious div and span), any scraper worth its salt isn't actually going to care: it's going to find the text on your page, then infer from nearest markup the intent. That's how Google works, BTW.

You have one realistic approach to this problem. First, attach an IDS monitor to your web server. When the IDS detects a scan pattern, throttle or shut down the IP. Or, and this is my favorite, throw the scanner into a honey pot with faked content. Ie, if your actual text reads "Freds widgets are the best in the world", serve an alternate page that reads "Bobs gonads fell short of maritime bliss."

I deploy that latter tactic on a couple of my customers' sites to hilarious results on Chinese copy cats.

查看更多
登录 后发表回答