how to fetch / grab polymer spa webpage by using p

2019-09-15 00:43发布

问题:

I'm trying to grab the content of the following url: https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html

My goal is to grab the content (source code) of the webpage as seen by the visitor, so after it has rendered all javascripts etc.

To do so I used the example mentioned here:http://techstonia.com/scraping-with-phantomjs-and-python.html

That example works on my server. But the challenge is to also have it work for polymer based SPA sites like the one mentioned. Those are really rendered javascript websites.

My code looks like:

import platform
from bs4 import BeautifulSoup
from selenium import webdriver

# PhantomJS files have different extensions
# under different operating systems
if platform.system() == 'Windows':
    PHANTOMJS_PATH = './phantomjs.exe'
else:
    PHANTOMJS_PATH = './phantomjs'


# here we'll use pseudo browser PhantomJS,
# but browser can be replaced with browser = webdriver.FireFox(),
# which is good for debugging.
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html')
print (browser)

The issue is that is delivers the following result:

<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<meta content="width=device-width, minimum-scale=1.0, initial-scale=1.0, user-scalable=yes" name="viewport">
<title>Single page app using Polymer</title>
<script async="" src="//www.google-analytics.com/analytics.js"></script><script src="/webcomponents.min.js"></script>
<!-- vulcanized version of imported elements --
       see "elements.html" for unvulcanized list of imports. -->
<link href="vulcanized.html" rel="import">
<link href="styles.css" rel="stylesheet" shim-shadowdom="">
</link></link></meta></meta></head>
<body fullbleed="" unresolved="">
<template id="t" is="auto-binding">
<!-- Route controller. -->
<flatiron-director autohash="" route="{{route}}"></flatiron-director>
<!-- Keyboard nav controller. -->
<core-a11y-keys id="keys" keys="up down left right space space+shift" on-keys-pressed="{{keyHandler}}" target="{{parentElement}}"></core-a11y-keys>
<core-scaffold id="scaffold">
<nav>
<core-toolbar>
<span>Single Page Polymer</span>
</core-toolbar>
<core-menu on-core-select="{{menuItemSelected}}" selected="{{route}}" selectedmodel="{{selectedPage}}" valueattr="hash">
<template repeat="{{page, i in pages}}">
<paper-item hash="{{page.hash}}" noink="">
<core-icon icon="label{{route != page.hash ? '-outline' : ''}}"></core-icon>
<a href="#{{page.hash}}">{{page.name}}</a>
</paper-item>
</template>
</core-menu>
</nav>
<core-toolbar flex="" tool="">
<div flex="">{{selectedPage.page.name}}</div>
<core-icon-button icon="refresh"></core-icon-button>
<core-icon-button icon="add"></core-icon-button>
</core-toolbar>
<div center-center="" fit="" horizontal="" layout="">
<core-animated-pages id="pages" on-tap="{{cyclePages}}" selected="{{route}}" transitions="slide-from-right" valueattr="hash">
<template repeat="{{page, i in pages}}">
<section center-center="" hash="{{page.hash}}" layout="" vertical="">
<div>{{page.name}}</div>
</section>
</template>
</core-animated-pages>
</div>
</core-scaffold>
</template>
<script src="app.js"></script>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-43475701-2', 'auto'); // ebidel's
  ga('create', 'UA-39334307-1', 'auto'); // pp.org
  ga('send', 'pageview');
</script>
</body></html>

As you see far from the real result you see when looking with your browser. The questions I have.... What do I do wrong and if possible where to look for the solution.

回答1:

I think you are missing something from the Selenium Webdriver docs. You can get the content of a dynamic page, but you have to make sure that the element you are searching is present and visible on the page:

import platform
from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get('https://docs-05-dot-polymer-
project.appspot.com/0.5/articles/demos/spa/final.html')

# Getting content of the first slide
res1 = browser.find_element_by_xpath('//*[@id="pages"]/section[1]/div')

# Save a screenshot so you can see why is failing (if it is)
browser.save_screenshot('screen_test')

# Print the text within the div
print (res1.text)

If you need to get also the text of the other slides, you need to click (using the webdriver) where needs to make visible the second slide, before getting the text from it.