How do you Programmatically Download a Webpage in

2018-12-31 14:10发布

I would like to be able to fetch a web page's html and save it to a String, so I can do some processing on it. Also, how could I handle various types of compression.

How would I go about doing that using Java?

11条回答
泛滥B
2楼-- · 2018-12-31 14:40

Here's some tested code using Java's URL class. I'd recommend do a better job than I do here of handling the exceptions or passing them up the call stack, though.

public static void main(String[] args) {
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
}
查看更多
柔情千种
3楼-- · 2018-12-31 14:40

All the above mentioned approaches do not download the web page text as it looks in the browser. these days a lot of data is loaded into browsers through scripts in html pages. none of above mentioned techniques supports scripts, they just downloads the html text only. HTMLUNIT supports the javascripts. so if you are looking to download the web page text as it looks in the browser then you should use HTMLUNIT.

查看更多
余生请多指教
4楼-- · 2018-12-31 14:41

I used the actual answer to this post (url) and writing the output into a file.

package test;

import java.net.*;
import java.io.*;

public class PDFTest {
    public static void main(String[] args) throws Exception {
    try {
        URL oracle = new URL("http://www.fetagracollege.org");
        BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));

        String fileName = "D:\\a_01\\output.txt";

        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        OutputStream outputStream = new FileOutputStream(fileName);
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            writer.println(inputLine);
        }
        in.close();
        } catch(Exception e) {

        }

    }
}
查看更多
看风景的人
5楼-- · 2018-12-31 14:44

Get help from this class it get code and filter some information.

public class MainActivity extends AppCompatActivity {

    EditText url;
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate( savedInstanceState );
        setContentView( R.layout.activity_main );

        url = ((EditText)findViewById( R.id.editText));
        DownloadCode obj = new DownloadCode();

        try {
            String des=" ";

            String tag1= "<div class=\"description\">";
            String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();

            url.setText( l );
            url.setText( " " );

            String[] t1 = l.split(tag1);
            String[] t2 = t1[0].split( "</div>" );
            url.setText( t2[0] );

        }
        catch (Exception e)
        {
            Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
        }

    }
                                        // input, extrafunctionrunparallel, output
    class DownloadCode extends AsyncTask<String,Void,String>
    {
        @Override
        protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
        {
            String htmlcontent = " ";
            try {
                URL url = new URL( WebAddress[0] );
                HttpURLConnection c = (HttpURLConnection) url.openConnection();
                c.connect();
                InputStream input = c.getInputStream();
                int data;
                InputStreamReader reader = new InputStreamReader( input );

                data = reader.read();

                while (data != -1)
                {
                    char content = (char) data;
                    htmlcontent+=content;
                    data = reader.read();
                }
            }
            catch (Exception e)
            {
                Log.i("Status : ",e.toString());
            }
            return htmlcontent;
        }
    }
}
查看更多
皆成旧梦
6楼-- · 2018-12-31 14:49

Try using the jsoup library.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class ParseHTML {

    public static void main(String args[]) throws IOException{
        Document doc = Jsoup.connect("https://www.wikipedia.org/").get();
        String text = doc.body().text();

        System.out.print(text);
    }
}

You can download the jsoup library here.

查看更多
人气声优
7楼-- · 2018-12-31 14:51

Well, you could go with the built-in libraries such as URL and URLConnection, but they don't give very much control.

Personally I'd go with the Apache HTTPClient library.
Edit: HTTPClient has been set to end of life by Apache. The replacement is: HTTP Components

查看更多
登录 后发表回答