Website blocks Python crawler. Searching for Idea

I want to crawl data from Object-sites from https://www.fewo-direkt.de (in US https://www.homeaway.com/) like this: https://www.fewo-direkt.de/ferienwohnung-ferienhaus/p8735326 But if the crawler tries to launch the page I'll get only a page with the code below. I think fewo blocks crawler, but I don't know how and wheter there is a pssible way to avoid. Have anyone an idea?

Python, requests, BeautifulSoup - With other Websites it works fine.

<html style="height:100%">
   <head>
      <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
      <meta content="telephone=no" name="format-detection"/>
      <meta content="initial-scale=1.0" name="viewport"/>
      <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
      <script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3" type="text/javascript"></script>
   </head>
   <body style="margin:0px;height:100%"><iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=20&amp;xinfo=5-259174360-0%200NNN%20RT%281546012021046%20144%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B15%283%2c901868%2c0%29%20U5&amp;incident_id=877000750783982903-1038286134589588661&amp;edet=15&amp;cinfo=03000000" width="100%">Request unsuccessful. Incapsula incident ID: 877000750783982903-1038286134589588661</iframe></body>
</html>

标签： python web-crawler

1条回答

smile是对你的礼貌

2楼-- · 2019-08-17 00:59

There is big amount of methods websites can use for bot detection. We can group them in next list:

Headers validation. It's the most widespread basic-level validation which check HTTP request headers for existence, nonexistence, default, fake or corrupted values.

E.g. default User-Agent in python requests starts from python-requests/, which can be easily checked on backend and as a result your client will be flagged as bot and get "error" response.

Solution: Try to sniff same request from browser (you can use Fiddler) and clone headers from browser. In python requests it can be done by next code:
```
headers = {
    "User-Agent": "Some User-Agent"
}
response = requests.get(url, headers=headers)
```
Cookies validation. Yes, Cookie is also HTTP header, but validation method differs from previous. Idea of this method is to check Cookie header and validate each cookie.

Solution:

1) Sniff all requests done by browser;

2) Check request you're trying to repeat and take a look on Cookie header;

3) Search values of each cookie in previous requests;

4) Repeat each request which set cookie(-s) before main request to collect all required cookies.

In python requests you don't need to scrape the manually, just use session:
```
http_session = requests.Session() 
http_session.get(url_to_get_cookie)  # cookies will be stored inside "http_session" object
response = http_session.get(final_url)
```
IP address or provider validation. Website can check IP address and provider to not be listed in spam databases. It's possible if you're using public proxies/VPN.

Solution: Try to use another proxies or change VPN.

Of course, it's oversimplified guide which doesn't include information about JavaScript generation of headers/tokens, "control" requests, WebSocket, etc. But, in my opinion, it can be helpful as entry-level guide which can point someone where to look for.

0人赞添加讨论(0) 举报

Website blocks Python crawler. Searching for Idea

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间