I would like to be able to fetch a web page's html and save it to a String
, so I can do some processing on it. Also, how could I handle various types of compression.
How would I go about doing that using Java?
I would like to be able to fetch a web page's html and save it to a String
, so I can do some processing on it. Also, how could I handle various types of compression.
How would I go about doing that using Java?
Bill's answer is very good, but you may want to do some things with the request like compression or user-agents. The following code shows how you can various types of compression to your requests.
To also set the user-agent add the following code:
Jetty has an HTTP client which can be use to download a web page.
The example prints the contents of a simple web page.
In a Reading a web page in Java tutorial I have written six examples of dowloading a web page programmaticaly in Java using URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit.
Here is an example of a download html file from a https web page. In the following example, the html file is being saved into c:\temp\filename.html Enjoy!
On a Unix/Linux box you could just run 'wget' but this is not really an option if you're writing a cross-platform client. Of course this assumes that you don't really want to do much with the data you download between the point of downloading it and it hitting the disk.
I'd use a decent HTML parser like Jsoup. It's then as easy as:
It handles GZIP and chunked responses and character encoding fully transparently. It offers more advantages as well, like HTML traversing and manipulation by CSS selectors like as jQuery can do. You only have to grab it as
Document
, not as aString
.You really don't want to run basic String methods or even regex on HTML to process it.
See also: