I need a workflow like below:
// load xyz.com in the browser window
// the browser is live, meaning users can interact with it
browser.load("http://www.google.com");
// return the HTML of the initially loaded page
String page = browser.getHTML();
// after some time
// user might have navigated to a new page, get HTML again
String newpage = browser.getHTML();
I am surprised to see how hard this is to do with Java GUIs such as JavaFX (http://lexandera.com/2009/01/extracting-html-from-a-webview/) and Swing.
Is there some simple way to get this functionality in Java?
Depending on stuff I don't know about your project this is either genious or moronic, but you could use a real browser in stead and instrument it with Selenium Webdriver. Only suggesting this as it appears from the other answer that you are going down a difficult path.
There's another question about extracting html with webdriver here. It's about using python, but webdriver has a java api as well.
Here is a contrived example using JavaFX that prints the html content to System.out - it should not be too complicated to adapt to create a
getHtml()
method. (I have tested it with JavaFX 8 but it should work with JavaFX 2 too).The code will print the HTML content everytime a new page is loaded.
Note: I have borrowed the
printDocument
code from this answer.You may want to see to djproject. But possibly you'll find JavaFX usage easier.
Below you will find a
SimpleBrowser
component which is aPane
containing aWebView
.Source code at gist.
Sample usage:
browser.getHTML()
is put inside aRunnable
because one needs to wait for a web page to download and render. Trying to invoke this method before page loading will return an empty page, so wrapping this into a runnable is a simple way I came up with to wait for a page to load.Demo Browser:
There is not a simple solution. In fact, there might not even be a solution at all short of building your own browser.
The key issue is interaction. If you want to display content only, then
JEditorPane
and many third party libs make that a more attainable goal. If you really need a user interacting with a webpage then either:On the returning the HTML side of things, it sounds like you are trying to capture history or refresh the page. In either case, it sounds like you are in the wrong technology. Either modify the original site, or add in some java script in the browser with Greasemonkey or something similar.