I am attempting to scrape the below website:
If you click the small button at the top-right of the table titled "export data", a javascript script runs and my browser downloads the file in .csv form. I'd like to be able to write a PhantomJS script that can do this automatically. Any ideas?
The above button is coded into HTML as such:
<a id="LB_cmdCSV" href="javascript:__doPostBack('LB$cmdCSV','')">Export Data</a></div>
I also found this function in the HTML source code:
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
I'm very new to PhantomJS/Javascript and could use some pointers here. I think I've found all the info I need to do this automatically (correct me if I'm wrong), but just not sure where to start on coding it. Thanks for any help.
EDIT - This is what my script looks like right now:
var page = new WebPage();
url = 'http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2011&month=0&season1=2011&ind=0&team=0&rost=0& players=0';
page.open(encodeURI(url), function (status){
if (status !== "success") {
console.log("Unable to access website");
} else {
page.evaluate(function() {
__doPostBack('LB$cmdCSV', '');
});
}
phantom.exit(0);
});
Couldn't you just run the code,
__doPostBack('LeaderBoard1$cmdCSV','');
, within the context of the webpage?Something like this:
I haven't tested this code within PhantomJS, but theoretically it should work since running the __doPostBack method from Google Chrome's developer console worked. If in doubt about running JavaScript code in PhantomJS, Google Chrome's developer console is a great way to test out the code as it runs on WebKit like PhantomJS. I hope this helps.
What have worked very well for me is simulating mouse clicks on the desired element.
It's an ASP powered website so this is going to be a tad trickier than most and you will have to use cURL commands to mimic POSTing the entire form viewstate & eventvalidation strings back to server. Probably just be easier just to lift the data straight out of the page you have.
I'm using Ruby on Rails and Watir Webdriver (https://github.com/watir/watir-webdriver).
I have identified that the tool using the ASP.NET when using the "doPostBack" identical browser used by the User Agent defined by the customer. When using PhantomJS the user agent is identified as something "Mozilla/5.0 (Unknown; Linux i686) AppleWebKit/534.34 (KHTML, like Gecko) Safari/534.34 PhantomJS/1.9.1".
Therefore it is necessary to change the user agent client before accessing the page. Rails and did something like: