Any way to tell selenium don't execute js at s

I want to crawl a site which have some generated content by js. That site run a js update content every 5 second (request a new encripted js file, can't parse).

my code:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)

driver.get(url)

trs = driver.find_elements_by_css_selector('.table tbody tr')

print len(trs)

for tr in trs:
    try:
        items.append(tr.text)
    except:
        # because the js update content, so this tr is missing
        pass

print len(items)

len(items) would not match len(trs). How to tell selenium stop executing js or stop working after I run trs = driver.find_elements_by_css_selector('.table tbody tr') ?

I need use trs later, so can not driver.quit()

Exception detail:

---------------------------------------------------------------------------
StaleElementReferenceException            Traceback (most recent call last)
<ipython-input-84-b80e3579efca> in <module>()
     11 items = []
     12 for tr in trs:
---> 13     items.append(tr.text)
     14     #items.append(map_label(hidemyass_label, tr.find_elements_by_tag_name('td')))
     15 

C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in text(self)
     69     def text(self):
     70         """The text of the element."""
---> 71         return self._execute(Command.GET_ELEMENT_TEXT)['value']
     72 
     73     def click(self):

C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in _execute(self, command, params)
    452             params = {}
    453         params['id'] = self._id
--> 454         return self._parent.execute(command, params)
    455 
    456     def find_element(self, by=By.ID, value=None):

C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.pyc in execute(self, driver_command, params)
    199         response = self.command_executor.execute(driver_command, params)
    200         if response:
--> 201             self.error_handler.check_response(response)
    202             response['value'] = self._unwrap_value(
    203                 response.get('value', None))

C:\Python27\lib\site-packages\selenium\webdriver\remote\errorhandler.pyc in check_response(self, response)
    179         elif exception_class == UnexpectedAlertPresentException and 'alert' in value:
    180             raise exception_class(message, screen, stacktrace, value['alert'].get('text'))
--> 181         raise exception_class(message, screen, stacktrace)
    182 
    183     def _value_or_default(self, obj, key, default):

StaleElementReferenceException: Message: {"errorMessage":"Element is no longer attached to the DOM","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:63305","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"GET","url":"/text","urlParsed":{"anchor":"","query":"","file":"text","directory":"/","path":"/text","relative":"/text","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/text","queryKey":{},"chunks":["text"]},"urlOriginal":"/session/4bb16340-a3b6-11e5-8ce5-9d0be40203a6/element/%3Awdc%3A1450243990539/text"}}
Screenshot: available via screen

Appearantly tr is missing.

PS: I need use selenium to select element. Other libs like lxml, pyquery don't know which element is display:none or not, .text() often get comment or something in <script> , and so on bugs. It's sad that python do not have a perfect clone of Jquery.

标签： python selenium web-crawler

1条回答

ゆ、 Hurt°

2楼-- · 2019-05-14 13:42

Use scrapy. Once you are sure the page has loaded, grab the body using:

response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')

You now have a static copy of the page so that you can use scrapy's response.xpath to pull whatever data you need. This answer as more detail.

0人赞添加讨论(0) 举报

Any way to tell selenium don't execute js at s

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间