The data on the webpage is displayed dynamically and it seems that checking for every change in the html and extracting the data is a very daunting task and also needs me to use very unreliable XPaths. So I would want to be able to extract the data from the XHR
packets.
I hope to be able to extract information from XHR
packets as well as generate 'XHR' packets to be sent to the server.
The extracting information part is more important for me because the sending of information can be handled easily by automatically triggering html elements using casperjs.
I'm attaching a screenshot of what I mean.
The text in the response tab is the data I need to process afterwards. (This XHR response has been received from the server.)
This is not easily possible, because the
resource.received
event handler only provides meta data likeurl
,headers
orstatus
, but not the actual data. The underlying phantomjs event handler acts the same way.Stateless AJAX Request
If the ajax call is stateless, you may repeat the request
You may want to add the event listener to
resource.requested
. That way you don't need to way for the call to complete.You can also do this right inside of the control flow like this (source: A: CasperJS waitForResource: how to get the resource i've waited for):
Stateful AJAX Request
If it is not stateless, you would need to replace the implementation of XMLHttpRequest. You will need to inject your own implementation of the
onreadystatechange
handler, collect the information in the pagewindow
object and later collect it in anotherevaluate
call.You may want to look at the XHR faker in sinon.js or use the following complete proxy for
XMLHttpRequest
(I modeled it after method 3 from How can I create a XMLHttpRequest wrapper/proxy?):If you want to capture the AJAX calls from the very beginning, you need to add this to one of the first event handlers
or
evaluate(replaceXHR)
when you need it.The control flow would look like this:
As described above, I create a proxy for XMLHttpRequest so that every time it is used on the page, I can do something with it. The page that you scrape uses the
xhr.onreadystatechange
callback to receive data. The proxying is done by defining a specific setter function which writes the received data towindow.myAwesomeResponse
in the page context. The only thing you need to do is retrieving this text.JSONP Request
Writing a proxy for JSONP is even easier, if you know the prefix (the function to call with the loaded JSON e.g.
insert({"data":["Some", "JSON", "here"],"id":"asdasda")
). You can overwriteinsert
in the page contextafter the page is loaded
or before the request is received (if the function is registered just before the request is invoked)
Additionally, you can also directly download the content and manipulate it later. Here is the example of the script I am using to retrieve a JSON and save it locally :
I may be late into the party, but the answer may help someone like me who would fall into this problem later in future.
I had to start with PhantomJS, then moved to CasperJS but finally settled with SlimerJS. Slimer is based on Phantom, is compatible with Casper, and can send you back the response body using the same onResponseReceived method, in "response.body" part.
Reference: https://docs.slimerjs.org/current/api/webpage.html#webpage-onresourcereceived