I'm trying to do some scraping, but I get blocked every 4 requests. I have tried to change proxies but the error is the same. What should I do to change it properly?
Here is some code where I try it. First I get proxies from a free web. Then I go do the request with the new proxy but it doesn't work because I get blocked.
from fake_useragent import UserAgent
import requests
def get_player(id,proxy):
ua=UserAgent()
headers = {'User-Agent':ua.random}
url='https://www.transfermarkt.es/jadon-sancho/profil/spieler/'+str(id)
try:
print(proxy)
r=requests.get(u,headers=headers,proxies=proxy)
execpt:
....
code to manage the data
....
Getting proxies
def get_proxies():
ua=UserAgent()
headers = {'User-Agent':ua.random}
url='https://free-proxy-list.net/'
r=requests.get(url,headers=headers)
page = BeautifulSoup(r.text, 'html.parser')
proxies=[]
for proxy in page.find_all('tr'):
i=ip=port=0
for data in proxy.find_all('td'):
if i==0:
ip=data.get_text()
if i==1:
port=data.get_text()
i+=1
if ip!=0 and port!=0:
proxies+=[{'http':'http://'+ip+':'+port}]
return proxies
Calling functions
proxies=get_proxies()
for i in range(1,100):
player=get_player(i,proxies[i//4])
....
code to manage the data
....
I know that proxies scrape is well because when i print then I see something like: {'http': 'http://88.12.48.61:42365'} I would like to don't get blocked.
The problem with using free proxies from sites like this is
websites know about these and may block just because you're using one of them
you don't know that other people haven't gotten them blacklisted by doing bad things with them
the site is likely using some form of other identifier to track you across proxies based on other characteristics (device fingerprinting, proxy-piercing, etc)
Unfortunately, there's not a lot you can do other than be more sophisticated (distribute across multiple devices, use VPN/TOR, etc) and risk your IP being blocked for attempting DDOS-like traffic or, preferably, see if the site has an API for access