Are there command line or library tools for render

2019-03-09 20:46发布

Page-scraping on the Internet has seem to have hit somewhat of a wall for me, as there are more and more sites that are dependent on JavaScript for rendering portions of the screen.

It seems to me that with so many open source layout and JavaScript renderers released (like WebKit, Gecko and Chromium + V8) that someone must have made a tool for downloading a page and rendering its JavaScript without having to run an actual browser. However, I'm not turning up what I'm looking for with my searches - I've found tools like Selenium-rc, but they depend on a running browser. I'm interested in any tool or library which can do one (or both) of the following:

  1. A program that can be run from the command line (*nix) which, given the source of a page, returns the page's source as rendered by some JS engine.

  2. Integrated support in a particular language that allows one to (easily) pass the source of a page to it and returns the page's source as rendered by some JS engine.

I think #1 is preferable in a general sense, but #2 would be more useful if the tool exists in the language I want to work in. Also, I'm not concerned with the particular JS engine - any relatively modern one will do. What is out there?

8条回答
聊天终结者
2楼-- · 2019-03-09 20:53

It's very little code to have a WebView render a page without displaying anything, but it has to be a GUI application. They can take command line arguments as well, and hide the window. Using WebKit directly it might be possible in a tool.

Apart from the complicated DOM access in Objective-C WebKit can also inject JavaScript, and together with jQuery that makes for a nice scraping solution. I don't know of any universal application doing that, though.

查看更多
姐就是有狂的资本
3楼-- · 2019-03-09 20:57

Since JavaScript can do quite a lot of manipulations to the web page's document object model (DOM), it seems like to accurately scrape the content of an arbitrary page, you'd need to not only run a JavaScript engine, you'd also need a complete and accurate DOM representation of the page. That's something you'll only get if you have a real browser engine instantiated. It is possible to use an embedded, not-displayed WebKit or Gecko engine for this, then after a suitable loading delay to allow for script execution, just dump the DOM contents in HTML form.

查看更多
Lonely孤独者°
4楼-- · 2019-03-09 20:57

i think there's an example code for Qt that uses the included WebKit to render a page to a pixmap. from there to a full CLI utility is just defining your needs.

of course, for most screen-scraping need you want the text, not a pixmap... if that's what you want, better check Rhino

查看更多
地球回转人心会变
5楼-- · 2019-03-09 21:02

There is the Cobra Engine for Java (http://lobobrowser.org/cobra.jsp), which handles Javascript (it also has a renderer, but that is optional). I've never used it, but have heard nice things said about it.

查看更多
Viruses.
6楼-- · 2019-03-09 21:05

You can look at HTMLUnit. It's main purpose is automatic web testing, but I think it may let you get the rendered page.

查看更多
走好不送
7楼-- · 2019-03-09 21:10

We used Rhino sometime ago to do some automated testing from Java. It seems it'll do the job for you :)

查看更多
登录 后发表回答