I want a script that will scrape a certain web page every hour, and will look for a certain string inside that page.
However, when I enter that page and use `view:source", I cannot see that string in the source. I was told that it's because the string I'm looking for comes from an element that is rendered on the client side (javascript), and thus I can see it only when I manually inspect that element with Chrome console for example.
Which practice / programming language / environment, would be the most efficient to achieve what I want, considering that I want to run that script from my webhost server, which has 2.25GB RAM?
Someone suggested that I will use Pyqt4, but my web-host warned me that this will kill my RAM and hurt server performance. I should note that the script supposed to be very simple, and scrape only a single page, once in an hour.
It seems that problem could be solved with PhantomJS, as it mocks real browser's action, which extracts information from client code.
For PhantomJS with Javascript, you may check testing-javascript-with-phantomjs
For how to use PhantomJS with python, please take a look at this
Hope it helps~
I cannot see that string in the source
If you only need to fetch one string of the page you might program to do the same what js performs.
If JS sends ajax request (GET or POST), you also do it using pure Python thus fetching the missing string.
Suppose in-page script performs the following (NB. code might be in pure JS see here an example):
$.ajax({
url: "test.html",
context: document.body
}).done(function() {
$( this ).addClass( "done" );
});
so in your Python scripting you request the 'test.html' file:
import requests
base='http://example.com/'
r = requests.get( base + 'test.html')
thus getting the data desired:
print r.headers['content-type']
// 'application/json; charset=utf8'
print r.text
// u'{"data":"<string>"...'