I want to crawl data from Object-sites from https://www.fewo-direkt.de (in US https://www.homeaway.com/) like this: https://www.fewo-direkt.de/ferienwohnung-ferienhaus/p8735326
But if the crawler tries to launch the page I'll get only a page with the code below. I think fewo blocks crawler, but I don't know how and wheter there is a pssible way to avoid. Have anyone an idea?
Python, requests, BeautifulSoup - With other Websites it works fine.
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3" type="text/javascript"></script>
</head>
<body style="margin:0px;height:100%"><iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=20&xinfo=5-259174360-0%200NNN%20RT%281546012021046%20144%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B15%283%2c901868%2c0%29%20U5&incident_id=877000750783982903-1038286134589588661&edet=15&cinfo=03000000" width="100%">Request unsuccessful. Incapsula incident ID: 877000750783982903-1038286134589588661</iframe></body>
</html>
There is big amount of methods websites can use for bot detection. We can group them in next list:
Headers validation. It's the most widespread basic-level validation which check HTTP request headers for existence, nonexistence, default, fake or corrupted values.
E.g. default User-Agent
in python requests starts from python-requests/
, which can be easily checked on backend and as a result your client will be flagged as bot and get "error" response.
Solution: Try to sniff same request from browser (you can use Fiddler) and clone headers from browser. In python requests it can be done by next code:
headers = {
"User-Agent": "Some User-Agent"
}
response = requests.get(url, headers=headers)
Cookies validation. Yes, Cookie
is also HTTP header, but validation method differs from previous. Idea of this method is to check Cookie
header and validate each cookie.
Solution:
1) Sniff all requests done by browser;
2) Check request you're trying to repeat and take a look on Cookie
header;
3) Search values of each cookie in previous requests;
4) Repeat each request which set cookie(-s) before main request to collect all required cookies.
In python requests you don't need to scrape the manually, just use session
:
http_session = requests.Session()
http_session.get(url_to_get_cookie) # cookies will be stored inside "http_session" object
response = http_session.get(final_url)
IP address or provider validation. Website can check IP address and provider to not be listed in spam databases. It's possible if you're using public proxies/VPN.
Solution: Try to use another proxies or change VPN.
Of course, it's oversimplified guide which doesn't include information about JavaScript generation of headers/tokens, "control" requests, WebSocket, etc. But, in my opinion, it can be helpful as entry-level guide which can point someone where to look for.