Efficient practice to scrape a page with Client-si

2019-05-31 00:49发布

问题:

I want a script that will scrape a certain web page every hour, and will look for a certain string inside that page.

However, when I enter that page and use `view:source", I cannot see that string in the source. I was told that it's because the string I'm looking for comes from an element that is rendered on the client side (javascript), and thus I can see it only when I manually inspect that element with Chrome console for example.

Which practice / programming language / environment, would be the most efficient to achieve what I want, considering that I want to run that script from my webhost server, which has 2.25GB RAM?

Someone suggested that I will use Pyqt4, but my web-host warned me that this will kill my RAM and hurt server performance. I should note that the script supposed to be very simple, and scrape only a single page, once in an hour.

回答1:

It seems that problem could be solved with PhantomJS, as it mocks real browser's action, which extracts information from client code.

For PhantomJS with Javascript, you may check testing-javascript-with-phantomjs

For how to use PhantomJS with python, please take a look at this

Hope it helps~



回答2:

I cannot see that string in the source

If you only need to fetch one string of the page you might program to do the same what js performs. If JS sends ajax request (GET or POST), you also do it using pure Python thus fetching the missing string.

Suppose in-page script performs the following (NB. code might be in pure JS see here an example):

$.ajax({
  url: "test.html",
  context: document.body
}).done(function() {
  $( this ).addClass( "done" );
});

so in your Python scripting you request the 'test.html' file:

import requests 
base='http://example.com/'
r = requests.get( base + 'test.html')

thus getting the data desired:

print r.headers['content-type']
// 'application/json; charset=utf8'
print r.text
// u'{"data":"<string>"...'