I'm trying to get PhantomJS to take an html string and then have it render the full page as a browser would (including execution of any javascript in the page source). I need the resulting html result as a string. I have seen examples of page.open which is of no use since I already have the page source in my database.
Do I need to use page.open to trigger the javascript rendering engine in PhantomJS? Is there anyway to do this all in memory (ie.. without page.open making a request or reading/writing html source from/to disk?
I have seen a similar question and answer here but it doesn't quite solve my issue. After running the code below, nothing I do seems to render the javascript in the html source string.
var page = require('webpage').create();
page.setContent('raw html and javascript in this string', 'http://whatever.com');
//everything i've tried from here on doesn't execute the javascript in the string
--------------Update---------------
Tried the following based on the suggestion below but this still does not work. Just returns the raw source that I supplied with no javascript rendered.
var page = require('webpage').create();
page.settings.localToRemoteUrlAccessEnabled = true;
page.settings.webSecurityEnabled = false;
page.onLoadFinished = function(){
var resultingHtml = page.evaluate(function() {
return document.documentElement.innerHTML;
});
console.log(resultingHtml);
//console.log(page.content); // this didn't work either
phantom.exit();
};
page.url = input.Url;
page.content = input.RawHtml;
//page.setContent(input.RawHtml, input.Url); //this didn't work either
The following works
page.onLoadFinished = function(){
console.log(page.content); // rendered content
};
page.content = "your source html string";
But you have to keep in mind that if you set the page from a string, the domain will be about:blank. So if the html loads resources from other domains, then you should run PhantomJS with the --web-security=false --local-to-remote-url-access=true
commandline options:
phantomjs --web-security=false --local-to-remote-url-access=true script.js
Additionally, you may need to wait for the completion of the JavaScript execution which might be not be finished when PhantomJS thought it finished. Use either setTimeout()
to wait a static amount of time or waitFor()
to wait for a specific condition on a page. More robust ways to wait for a full page are given in this question: phantomjs not waiting for “full” page load
The setTimeout made it work even though I'm not excited to wait a set amount of time for each page. The waitFor approach that is discussed here doesn't work since I have no idea what elements each page might have.
var system = require('system');
var page = require('webpage').create();
page.setContent(input.RawHtml, input.Url);
window.setTimeout(function () {
console.log(page.content);
phantom.exit();
}, input.WaitToRenderTimeInMilliseconds);
Maybe not the answer you want, but using PhantomJsCloud.com you can do it easily, Here's an example: http://api.phantomjscloud.com/api/browser/v2/a-demo-key-with-low-quota-per-ip-address/?request={url:%22http://example.com%22,content:%22%3Ch1%3ENew%20Content!%3C/h1%3E%22,renderType:%22png%22,scripts:{domReady:[%22var%20hiDiv=document.createElement%28%27div%27%29;hiDiv.innerHTML=%27Hello%20World!%27;document.body.appendChild%28hiDiv%29;window._pjscMeta.scriptOutput={Goodbye:%27World%27};%22]},outputAsJson:false} The "New Content!" is the content that replaces the original content, and the "Hello World!" is placed in the page by a script.
If you want to do this via normal PhantomJs, you'll need to use the injectJs or includeJs functions, after the page content is loaded.