I'd like to either remove an HTML element or simply remove first N characters of a webpage before evaluating/rendering it.
Is there any way to do that?
I'd like to either remove an HTML element or simply remove first N characters of a webpage before evaluating/rendering it.
Is there any way to do that?
It depends on multiple scenarios. I will only outline the steps for each combination of the answers to the following questions.
1. Script is loaded separately (ls) & code can be removed completely (rc)
Register to the
onResourceRequested
listener andrequest.abort()
depending on the matched url.2. Script is loaded separately (ls) & contains other code too (nr)
This can only be done when the following code blocks do not depend on the code that should not be removed (which is unlikely). This is most likely necessary for click events that are registered in the DOM.
In this case cancel the request like in 1., download the script through an XHR, remove the unwanted code parts and add code block to the DOM. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain:
--web-security=false
.3. Script is loaded with the DOM (is) & JS executed through
onload
(ol) & can be removed completely (rc)This is probably very error prone. You would begin an Interval with
setInterval(function(){}, 5)
from apage.onInitialized
callback. Inside the interval you would need to check ifwindow.onload
(or something else you can get your hands on) is set in the page context. You remove it, if it is indeed the function that you wanted to remove by checkingwindow.onload.toString().match(/something/)
.This can be done directly and completely inside the page context (inside
page.evaluate
).4. Script is loaded with the DOM (is) & JS executed through
onload
(ol) & contains other code too (nr)Begin like in 3., but instead of removing
window.onload
, you can do5. Script is loaded with the DOM (is) & the script block immediately evaluated (ie)
You can load the page as an XHR, replace the text and apply the adjusted content to the page. This will essentially be a filled
about:blank
page. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain:--web-security=false
or--local-to-remote-url-access=true
. This would also work for 3. and 4..There is still one problem though. Pages don't use full URLs most of the time. So when a script or element refers to
stuff.php
PhantomJS cannot request it. When thepage.content
is set then the page URL is essentially about:blank and all requests with incomplete URLs point tofile:///...
. Obviously there are no such files. Those resources must be replaced with their full URL counterparts.There are three types of such URLs:
//example.com/resource.php
variable protocol/resource.php
variable protocol and domainresource.php
variable protocol, domain and path to resourceComplete example:
You might want to look into proper manipulation techniques apart from the simple string replace.