My script parses all the links again and again fro

I've written a script using python in combination with selenium to get all the company links from a webpage which doesn't display all the links until scrolled downmost. However, when I run my script, I get desired links but there are lots of duplicates being scraped along. At this point, I can't get any idea how can I modify my script to get the unique links. Here is what I've tried so far:

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)

    for items in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]"):
        item = items.find_elements_by_xpath('.//a')[0]
        print(item.get_attribute("href"))

driver.close()

标签： python-3.x selenium selenium-webdriver web-scraping lazy-loading

2条回答

叛逆

2楼-- · 2019-09-16 17:51

You can try below code:

from selenium.webdriver.support.ui import WebDriverWait as wait 
from selenium.common.exceptions import TimeoutException

my_links = []
while True:
    try:
        current_length = len(my_links)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait(driver, 10).until(lambda: len(driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")) > current_length)
        my_links.extend([a.get_attribute("href") for a in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")])
    except TimeoutException:
        break

my_links = set(my_links)

This should allow you to scroll down and collect new links while it's possible. Finally with set() you can leave only unique values

0人赞添加讨论(0) 举报

地球回转人心会变

3楼-- · 2019-09-16 17:52

I don't know python but I do know what you are doing wrong. Hopefully you'll be able to figure out the code for yourself ;)

Every time you scroll down 50 links are added to the page until there are 1000 links. Well almost... it starts with 20 links and then adds 30 and then 50 each time until there are 1000.

The way your code is now you are printing of:

The 1st 20 links.

The 1st 20 again + the next 30.

The 1st 50 + the next 50.

And so on...

What you actually want to do is just scroll down the page until you have all the links on the page and then print them. Hope that helps.

Here's the updated Python code (I've checked it and it works)

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')


while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    listElements = driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")
    print(len(listElements))
    if (len(listElements) == 1000):
        break

for item in listElements:
    print(item.get_attribute("href"))

driver.close()

If you want it to work a bit faster you could swap out the "time.sleep(5)" for Anderson's wait statement

0人赞添加讨论(0) 举报

My script parses all the links again and again fro

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间