Website blocks Python crawler. Searching for Idea

2019-08-17 00:58发布

问题:

I want to crawl data from Object-sites from https://www.fewo-direkt.de (in US https://www.homeaway.com/) like this: https://www.fewo-direkt.de/ferienwohnung-ferienhaus/p8735326 But if the crawler tries to launch the page I'll get only a page with the code below. I think fewo blocks crawler, but I don't know how and wheter there is a pssible way to avoid. Have anyone an idea?

Python, requests, BeautifulSoup - With other Websites it works fine.

<html style="height:100%">
   <head>
      <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
      <meta content="telephone=no" name="format-detection"/>
      <meta content="initial-scale=1.0" name="viewport"/>
      <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
      <script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3" type="text/javascript"></script>
   </head>
   <body style="margin:0px;height:100%"><iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=20&amp;xinfo=5-259174360-0%200NNN%20RT%281546012021046%20144%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B15%283%2c901868%2c0%29%20U5&amp;incident_id=877000750783982903-1038286134589588661&amp;edet=15&amp;cinfo=03000000" width="100%">Request unsuccessful. Incapsula incident ID: 877000750783982903-1038286134589588661</iframe></body>
</html>

回答1:

There is big amount of methods websites can use for bot detection. We can group them in next list:

  1. Headers validation. It's the most widespread basic-level validation which check HTTP request headers for existence, nonexistence, default, fake or corrupted values.

    E.g. default User-Agent in python requests starts from python-requests/, which can be easily checked on backend and as a result your client will be flagged as bot and get "error" response.

    Solution: Try to sniff same request from browser (you can use Fiddler) and clone headers from browser. In python requests it can be done by next code:

    headers = {
        "User-Agent": "Some User-Agent"
    }
    response = requests.get(url, headers=headers)
    
  2. Cookies validation. Yes, Cookie is also HTTP header, but validation method differs from previous. Idea of this method is to check Cookie header and validate each cookie.

    Solution:

    1) Sniff all requests done by browser;

    2) Check request you're trying to repeat and take a look on Cookie header;

    3) Search values of each cookie in previous requests;

    4) Repeat each request which set cookie(-s) before main request to collect all required cookies.

    In python requests you don't need to scrape the manually, just use session:

    http_session = requests.Session() 
    http_session.get(url_to_get_cookie)  # cookies will be stored inside "http_session" object
    response = http_session.get(final_url)
    
  3. IP address or provider validation. Website can check IP address and provider to not be listed in spam databases. It's possible if you're using public proxies/VPN.

    Solution: Try to use another proxies or change VPN.

Of course, it's oversimplified guide which doesn't include information about JavaScript generation of headers/tokens, "control" requests, WebSocket, etc. But, in my opinion, it can be helpful as entry-level guide which can point someone where to look for.