I'm scraping Bet365, probably one of the most tricky websites I've encountered, with selenium and Chrome. The issue with this page is that, even though my scraper takes sleeps so in no way it runs faster of what a human could, at some point, sometimes, it blocks my ip from a random amount of time (between half and 2 hours).
So, I'm looking into proxies to change my IP and resume my scraping. And here is where I'm kind of stuck trying to decide how to approach this
I've used 2 different free ip providers as follows
I wasn't able to make this one work, I'm emailing their support, but what I have, which should work is as follows
import requests
api="MY_API_KEY" #with the free plan I can ask 240 times a day for an IP
adder="&post=true&supportsHttps=true&maxCheckPeriod=3600"
url="https://gimmeproxy.com/api/getProxy?"
r=requests.get(url=url,params=adder)
THIS IS EDITED
apik="api_key={}".format(api)
r=requests.get(url=url,params=apik+adder)
aaand I get no answer. 404 error not found. NOW WORKS, MY BAD
My second approach is through this other site sslproxy
With this one, you scrape the page, and you get a list of 100 IPs, theoretically checked and working. So, I've set up a loop in which I try a random IP from that list, and if it doesn't work deletes it from the list and tries again. This approach works hen trying to open Bet365.
for n in range(1, 100):
proxy_index=random.randint(0, len(proxies) - 1)
proxi=proxies[proxy_index]
PROXY=proxi['ip']+':'+proxi['port']
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server={}'.format(PROXY))
url="https://www.bet365.es"
try:
browser=webdriver.Chrome(path,options=chrome_options)
browser.get(url)
WebDriverWait(browser,10)..... #no need to post the whole condition
break
except:
del proxies[proxy_index]
browser.quit()
Well, with this one I succeded on trying to open Bet365, and I'm still checking, but I think this webdriver is going to be much slower than my original one, with no proxy.
So, my question is, is it expected that using proxy the scraping is going to be much slower, or does it depend on the proxy used? If so, does anyone recommed a different (or better, surely) approach?
I don't see any significant issue either in your approach or your code block. However, another approach would be to make use of all the proxies marked with in the Last Checked column which gets updated within the Free Proxy List.
As a solution you can write a script to grab all the proxies available and create a List dynamically every time you initialize your program. The following program will invoke a proxy from the Proxy List one by one until a successful proxied connection is established and verified through the Page Title of
https://www.bet365.es
to contain the text bet365. An exception may arise because the free proxy which your program grabbed was overloaded with users trying to get their proxy traffic through.Code Block:
Console Output: