Python Webscraping Selenium and BeautifulSoup (Mod

2019-05-29 05:03发布

问题:

I am trying to learn webscraping (I am a total novice). I noticed that on some websites (for eg. Quora), when I click a button and a new element comes up on screen. I cannot seem to get the page source of the new element. I want to be able to get the page source of the new popup and get all the elements. Note that you need to have a Quora account in order to understand my problem.

I have a part of a code that you can use using beautifulsoup, selenium and chromedriver:

from selenium import webdriver
from bs4 import BeautifulSoup
from unidecode import unidecode 
import time

sleep = 10
USER_NAME = 'Insert Account name' #Insert Account name here
PASS_WORD = 'Insert Account Password' #Insert Account Password here
url = 'Insert url' 
url2 = ['insert url']
#Logging in to your account
driver = webdriver.Chrome('INSERT PATH TO CHROME DRIVER')
driver.get(url)
page_source=driver.page_source
if 'Continue With Email' in page_source:
    try:
        username = driver.find_element(By.XPATH, '//input[@placeholder="Email"]')
        password = driver.find_element(By.XPATH, '//input[@placeholder="Password"]')
        login= driver.find_element(By.XPATH, '//input[@value="Login"]')
        username.send_keys(USER_NAME)
        password.send_keys(PASS_WORD)
        time.sleep(sleep)
        login.click()
        time.sleep(sleep)
    except:
        print ('Did not work :( .. Try again')
else:
    print ('Did not work :( .. Try different page')


Next part will go to the concerned webpage and ("try to") collect information about the followers of a particular question.

for url1 in url2:        
    driver.get(url1)
    source = driver.page_source
    soup1 = BeautifulSoup(source,"lxml")  
    Follower_button = soup1.find('a',{'class':'FollowerListModalLink QuestionFollowerListModalLink'})
    Follower_button2 = unidecode(Follower_button.text)
    driver.find_element_by_link_text(Follower_button2).click()

####Does not gives me correct page source in the next line####
    source2=driver.page_source
    soup2=BeautifulSoup(source2,"lxml")

    follower_list = soup2.findAll('div',{'class':'FollowerListModal QuestionFollowerListModal Modal'})
    if len(follower_list)>0:
        print 'It worked :)'
    else:
        print 'Did not work :('

However when I try to get the page source of the followers element, I end up getting the page source of the main page rather than the follower element. Can anyone help me to get the page source of the follower element that pops up?? What am I not getting here.

NOTE: Another way of recreating or looking at my problem is to log in to your Quora account (if you have one) and then go to any question with followers. If you click the followers button on the lower right side of the screen, that will result in a popup. My problem is essentially to get the elements of this popup.


Update - Okay so I have been reading a bit and it seems like the window is a modal window. Does anyone help me with getting contents of a modal window?

回答1:

Problem resolved. All I had to do was to add one line:

time.sleep(sleep_time)

after generating the click. The problem was because there was no wait time initially, the page source was not getting updated. However with time.sleep sufficiently long (may vary from website to website), the page source finally got updated and I was able to get the required elements. :) Lesson learnt. Patience is the key to web scraping. Spent the entire day trying to figure this out.