Am using PhantomJS and CasperJS for screenscraping and stuff. The issue which I am facing is that its taking too much CPU usage which makes me feel it might not be that scalable. Are there any ways to reduce CPU usage for the same. Some of which I can think of are:
1) Disable image loading
2) Disable js loading
Also I want to know if python is more light(in terms of CPU usage) than phantom for the scraping purpose.
After 5 and a half years I don't think you are having this issue anymore, but if anyone else stumbles across this problem, here's the solution.
After finishing scraping, quit the browser by typing browser.quit()
, browser being the name of the variable you set.
Why CasperJS / PhantomJS only? Are you scraping websites that load content with JavaScript? Any tool that doesn't run a full webkit browser will be more lightweight than one that does.
As mentioned in the comments, you can use wget
or curl
on linux systems to dump webpages to files / stdout. There are many libraries that can handle & parse raw HTML such as cheerio for NodeJS.
Still want some form of scripting? Because you mentioned python, there is a tool called Mechanize that does just that without running webkit. It's not as powerful as Casper / Phantom, but it allows you to do a lot of the same things (filling out forms, clicking links, etc) with a much smaller footprint.