I use curl, in php and httplib2 in python to fetch URL.
However, there are some pages that use JavaScript (AJAX) to retrieve the data after you have loaded the page and they just overwrite a specific section of the page afterward.
So, is there any command line utility that can handle JavaScript?
To know what I mean go to: monster.com and try searching for a job.
You'll see that the Ajax is getting the list of jobs afterward. So, if I wanted to pull in the jobs based on my keyword search, I would get the page with no jobs.
But via browser it works.
Get FireBug and see the URL for that Ajax request. You may then use curl with that URL.
There are 2 ways to handle this. Write your screen scraper using a full browser based client like Webkit, or go to the actual page and find out what the AJAX requesting is doing and do request that directly. You then need to parse the results of course. Use firebug to help you out.
Check out this post for more info on the subject. The upvoted answer suggests using a test tool to drive a real browser.
What's a good tool to screen-scrape with Javascript support?
I think env.js can handle <script>
elements. It runs in the Rhino JavaScript interpreter and has it's own XMLHttpRequest object, so you should be able to at least run the scripts manually (select all the <script>
tags, get the .js file, and call eval
) if it doesn't automatically run them. Be careful about running scripts you don't trust though, since they can use any Java classes.
I haven't played with it since John Resig's first version, so I don't know much about how to use it, but there's a discussion group on Google Groups.
Maybe you could try and use features of HtmlUnit in your own utility?
HtmlUnit is a "GUI-Less browser for
Java programs". It models HTML
documents and provides an API that
allows you to invoke pages, fill out
forms, click links, etc... just like
you do in your "normal" browser.
It has fairly good JavaScript support
(which is constantly improving) and is
able to work even with quite complex
AJAX libraries, simulating either
Firefox or Internet Explorer depending
on the configuration you want to use.
It is typically used for testing
purposes or to retrieve information
from web sites.
Use LiveHttpHeaders a plug in for Firefox to see all URL details and then use the cURL with that url.
LiveHttpHeaders shows all information like type of method(post or get) and headers body etc.
it also show post or get parameters in headers
i think this may help you.
you can use PhantomJS
http://phantomjs.org
You can use it as below :
var page=require("webpage");
page.open("http://monster.com",function(status){
page.evaluate(function(){
/* your javascript code here
$.ajax("....",function(result){
phantom.exit(0);
}); */
});
});