Consider the case, I want to crawl websites frequently, but my IP address got blocked after some day/limit.
So, how can change my IP address dynamically or any other ideas?
Consider the case, I want to crawl websites frequently, but my IP address got blocked after some day/limit.
So, how can change my IP address dynamically or any other ideas?
If you have public IPs. Add them on your interface and if you are using Linux use Iptables for switching those public IPs.
Iptables sample rules for two IPs
If you have 4 IPs then probablity will become 0.25.
You can also create your own proxy with simple steps.
These rules will allow the proxy server to switch its outgoing IPS.
An approach using Scrapy will make usage of two components a
RandomProxy
and aRotateUserAgentMiddleware
and the modification ofDOWNLOADER_MIDDLEWARES
as it follows:DOWNLOADER_MIDDLEWARS
You will have to insert the new components in the
settings.py
Random Proxy:
This component will process Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.
More details here: (https://github.com/aivarsk/scrapy-proxies) You can build up your proxy list from a quick internet search. Copy links in the list.txt file according to requested url format.
Rotation of user agent
For each scrapy request a random user agent will be used from a list you define in advance
More details here: (https://gist.github.com/seagatesoft/e7de4e3878035726731d)
You can try using proxy servers to prevent being blocked. There are services providing working proxies. The best I tried is https://gimmeproxy.com - they frequently check proxies for various parameters.
In order to get proxy from them, you need just to make the following request:
They will provide JSON response with all proxy data which you can use later as needed:
You can use it like this with Curl:
Some VPN applications allow you to automatically change your IP address to a new random IP address at a set interval such as: every 2 minutes. Both HMA! Pro VPN and VPN4ALL software support this feature.
If you are using R, you could do the web crawling through TOR. I think TOR resets its IP-adress every 10 minutes(?) automatically. I think there is a way forcing TOR to change the IP in shorter intervals, but that didn't work for me. Instead you could set up multiple instances of TOR and then switch between the independent instances (here you can find a good explaination of how to set up multiple instances of TOR: https://tor.stackexchange.com/questions/2006/how-to-run-multiple-tor-browsers-with-different-ips)
After that you could do something like the following in R (use the ports of your independent TOR browsers and a list of useragents. Every time you call the 'getURL'-function cycle through your list of ports/useragents)