I'm not sure why, but my script always stops crawling once it hits page 9. There are no errors, exceptions, or warnings, so I'm kind of at a loss.
Can somebody help me out?
P.S. Here is the full script in case anybody wants to test it for themselves!
def initiate_crawl():
def refresh_page(url):
ff = create_webdriver_instance()
ff.get(url)
ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
items = WebDriverWait(ff, 15).until(
EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
)
print(len(items))
for count, item in enumerate(items):
slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
if len(slashed_price) > 0 and len(active_deals) > 0:
product_title = item.find_element(By.ID, 'dealTitle').text
if product_title not in already_scraped_product_titles:
already_scraped_product_titles.append(product_title)
url = ff.current_url
ff.quit()
refresh_page(url)
break
if count+1 is len(items):
try:
next_button = WebDriverWait(ff, 15).until(
EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
)
ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
url = ff.current_url
ff.quit()
refresh_page(url)
except Exception as error:
print(error)
ff.quit()
refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')
initiate_crawl()
Printing the length of items
invokes some strange behaviour too. Instead of it always returning 32, which would correspond to the number of items on each page, it prints 32
for the first page, 64
for the second, 96
for the third, so on and so forth. I fixed this by using //div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")]
instead of //div[contains(@id, "100_dealView_")]
as the XPath for the items
variable. I'm hoping this is the reason why it runs into issues on page 9. I'm running tests right now. Update: It is now scraping page 10 and beyond, so the issue is resolved.
As per your 10th revision of this question the error message...
...implies that the
get()
method failed raising HTTPConnectionPool error with a message Max retries exceeded.A couple of things:
Requests never retries (it sets the
retries=0
for urllib3'sHTTPConnectionPool
), so the error would have been much more canonical without the MaxRetryError and HTTPConnectionPool keywords. So an ideal Traceback would have been:You will find a detailed explaination in MaxRetryError: HTTPConnectionPool: Max retries exceeded (Caused by ProtocolError('Connection aborted.', error(111, 'Connection refused')))
Solution
As per the Release Notes of Selenium 3.14.1:
The Merge is: repair urllib3 can't set timeout!
Conclusion
Once you upgrade to Selenium 3.14.1 you will be able to set the timeout and see canonical Tracebacks and would be able to take required action.
References
A couple of relevent references:
This usecase
I have taken your full script from codepen.io - A PEN BY Anthony. I had to make a few tweaks to your existing code as follows:
As you have used:
You have to mandatorily import
random
as:You have created the variable next_button but haven't used it. I have clubbed up the following four lines:
As:
Your modified code block will be:
Console Output: With Selenium v3.14.0 and Firefox Quantum v62.0.3, I can extract the following output on the console:
I slightly adjusted the code and it seems to work. Changes:
import random
statement because it is used and would not run without it.Inside
product_title
loop these lines are removed:ff.quit()
,refresh_page(url)
andbreak
The
ff.quit()
statement would cause a fatal (connection) error causing the script to break.Also
is
changed to==
forif count + 1 == len(item):