可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Are there any libraries or frameworks that provide the functionality of a browser, but do not need to actually render physically onto the screen?
I want to automate navigation on web pages (Mechanize does this, for example), but I want the full browser experience, including Javascript. Thus, I'd like to have a virtual browser of some sort, that I can use to "click on links" programmatically, have DOM elements and JS scripts render within it, and manipulate these elements.
Solution preferably in Python, but I can manage others.
回答1:
PhantomJS and PyPhantomJS are what I use for tasks like these.
What it is, is a headless WebKit based browser which is fully controllable via JavaScript. There's a C++ implementation (PhantomJS) and a Python one (PyPhantomJS). I prefer the Python one though, because it has a plugin system which allows you to add functionality to the core without actually modifying any code, unlike the C++ one. :)
回答2:
There is an absolute ton of free software technology now available: take your pick at http://wiki.python.org/moin/WebBrowserProgramming but if you have specific questions join pyjamas-dev on google groups and i'll be happy to give further details, there. brief answer: you can run pywebkitgtk "headless", or you can use xulrunner (via python-hulahop) again using pygtk without actually doing "browserwidget.show()", and there's also pykhtml. also you could use python COM to connect to MSHTML.DLL.
these are all "cheat" methods: using python bindings to a graphical web browser engine without actually firing up the graphical bit. if you really wanted to put some serious hard-core programming in, you could create a "port" of webkit which was not connected to a GUI toolkit: as an experienced webkit programmer i'd put it as around... 2 weeks of full-time effort to make such a "headless" version of webkit.
l.
回答3:
Looks like http://watin.sourceforge.net/ might be a good way to go.
If you don't have to go pure Python, you could do IronPython since it's a C# project.
回答4:
take a look at this little doosy on ajaxian
http://ajaxian.com/archives/server-side-rendering-with-yui-on-node-js
It also talks about Aptana Jaxer which I think runs on a headless firefox so is basically the Mozilla browser engine in all it's glory.
回答5:
There is Kapow. Its pure Java and costs money:
http://kapowtech.com/
And there is Lixto: Its Eclipse based and uses Mozilla Gecko as rendering engine (unless they already changed it to WebKit, as they said they'll do years ago). Its very nice and also costs money:
http://www.lixto.com/?page_id=50
They are both graphical tools where you define the site navigation and what should be extracted by point and click. But you can also write xpath and regular expressions and even JavaScript that runs in the sites context.
I used them both in the lectures web data extraction and applied web data extraction at the technical university Vienna (Lixto is written by the Professor who held the lecture).
回答6:
HTMLUnit in Java is very good. I think it's only the Java implementations of headless browsers that manage to provide Javascript support.
MaxQ, I read about here, sounds like it might be interesting: "written in Java, generates Jython scripts"
回答7: