PhantomJS change webpage content before evaluating

2019-03-14 07:57发布

问题:

I'd like to either remove an HTML element or simply remove first N characters of a webpage before evaluating/rendering it.

Is there any way to do that?

回答1:

It depends on multiple scenarios. I will only outline the steps for each combination of the answers to the following questions.

  1. Is the piece of JS called onload (ol) or is the script block immediately evaluated (ie)?
  2. Is it an inline script (is) or is the script loaded separately (src attribute) (ls)?
  3. Does the script block also contain some code that should not be removed (nr) or can it be removed completely (rc)?

1. Script is loaded separately (ls) & code can be removed completely (rc)

Register to the onResourceRequested listener and request.abort() depending on the matched url.

2. Script is loaded separately (ls) & contains other code too (nr)

This can only be done when the following code blocks do not depend on the code that should not be removed (which is unlikely). This is most likely necessary for click events that are registered in the DOM.

In this case cancel the request like in 1., download the script through an XHR, remove the unwanted code parts and add code block to the DOM. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain: --web-security=false.

3. Script is loaded with the DOM (is) & JS executed through onload (ol) & can be removed completely (rc)

This is probably very error prone. You would begin an Interval with setInterval(function(){}, 5) from a page.onInitialized callback. Inside the interval you would need to check if window.onload (or something else you can get your hands on) is set in the page context. You remove it, if it is indeed the function that you wanted to remove by checking window.onload.toString().match(/something/).

This can be done directly and completely inside the page context (inside page.evaluate).

4. Script is loaded with the DOM (is) & JS executed through onload (ol) & contains other code too (nr)

Begin like in 3., but instead of removing window.onload, you can do

eval("window.onload = " + window.onload.toString().replace(/something/,''))

5. Script is loaded with the DOM (is) & the script block immediately evaluated (ie)

You can load the page as an XHR, replace the text and apply the adjusted content to the page. This will essentially be a filled about:blank page. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain: --web-security=false or --local-to-remote-url-access=true. This would also work for 3. and 4..

There is still one problem though. Pages don't use full URLs most of the time. So when a script or element refers to stuff.php PhantomJS cannot request it. When the page.content is set then the page URL is essentially about:blank and all requests with incomplete URLs point to file:///.... Obviously there are no such files. Those resources must be replaced with their full URL counterparts.
There are three types of such URLs:

  • //example.com/resource.php variable protocol
  • /resource.php variable protocol and domain
  • resource.php variable protocol, domain and path to resource

Complete example:

var page = require('webpage').create(),
    url = 'http://www.example.com';

page.open(url, function(status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var content = page.evaluate(function(url){
            var xhr = new XMLHttpRequest();
            xhr.open("GET", url, false);
            xhr.send();
            return xhr.responseText;
        }, url);
        page.render("test_example.png");
        page.content = content.replace(/xample/g,"asy");
        page.render("test_easy.png");
        console.log("url "+page.url); // about:blank
        phantom.exit();
    }
});

You might want to look into proper manipulation techniques apart from the simple string replace.



标签: dom phantomjs