My script parses all the links again and again fro

2019-09-16 17:08发布

I've written a script using python in combination with selenium to get all the company links from a webpage which doesn't display all the links until scrolled downmost. However, when I run my script, I get desired links but there are lots of duplicates being scraped along. At this point, I can't get any idea how can I modify my script to get the unique links. Here is what I've tried so far:

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)

    for items in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]"):
        item = items.find_elements_by_xpath('.//a')[0]
        print(item.get_attribute("href"))

driver.close()

2条回答
叛逆
2楼-- · 2019-09-16 17:51

You can try below code:

from selenium.webdriver.support.ui import WebDriverWait as wait 
from selenium.common.exceptions import TimeoutException

my_links = []
while True:
    try:
        current_length = len(my_links)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait(driver, 10).until(lambda: len(driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")) > current_length)
        my_links.extend([a.get_attribute("href") for a in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")])
    except TimeoutException:
        break

my_links = set(my_links)

This should allow you to scroll down and collect new links while it's possible. Finally with set() you can leave only unique values

查看更多
地球回转人心会变
3楼-- · 2019-09-16 17:52

I don't know python but I do know what you are doing wrong. Hopefully you'll be able to figure out the code for yourself ;)

Every time you scroll down 50 links are added to the page until there are 1000 links. Well almost... it starts with 20 links and then adds 30 and then 50 each time until there are 1000.

The way your code is now you are printing of:

The 1st 20 links.

The 1st 20 again + the next 30.

The 1st 50 + the next 50.

And so on...

What you actually want to do is just scroll down the page until you have all the links on the page and then print them. Hope that helps.

Here's the updated Python code (I've checked it and it works)

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')


while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    listElements = driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")
    print(len(listElements))
    if (len(listElements) == 1000):
        break

for item in listElements:
    print(item.get_attribute("href"))

driver.close()

If you want it to work a bit faster you could swap out the "time.sleep(5)" for Anderson's wait statement

查看更多
登录 后发表回答