Jsoup not downloading entire page

2019-07-04 02:45发布

问题:

The webpage is: http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm

I want to extract all the <tr class="tr_normal"> elements using Jsoup.

The code I am using is:

Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());

But the size (1350) is smaller than actually have (1452). I copied this page onto my computer and deleted some <tr> elements. Then I ran the same code and it's correct. It looks like there are too many elements so jsoup can't read all of them?

So what's happened? Thanks!

回答1:

The problem is the internal Jsoup Http Connection Handling. Nothing wrong with the selector engine. I didn't go deep in but there always problem with proprietary way to handle http connection. I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection. The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample. Output of the program

  • load from file= 1452
  • load from http client= 1452
  • load from jsoup connect= 1350
  • load from jsoup connect using maxBodySize= 1452

    package test;
    
    import java.io.IOException;
    import java.io.InputStream;
    
    import org.apache.http.HttpResponse;
    import org.apache.http.client.ClientProtocolException;
    import org.apache.http.client.HttpClient;
    import org.apache.http.client.methods.HttpGet;
    import org.apache.http.impl.client.HttpClientBuilder;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.select.Elements;
    
    public class TestJsoup {
    
        /**
         * @param args
         * @throws IOException
         */
        public static void main(String[] args) throws IOException {
            Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
            Elements es = doc.getElementsByClass("tr_normal");
            System.out.println("load from file= " + es.size());
    
            doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from http client= " + es.size());
    
            String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                    + "/stockcode/eisdeqty_pf.htm";
            doc = Jsoup.connect(url).get();
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from jsoup connect= " + es.size());
    
            int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
            doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
            es = doc.getElementsByClass("tr_normal");
            System.out.println("load from jsoup connect using maxBodySize= " + es.size());
        }
    
        public static InputStream loadContentByHttpClient()
                throws ClientProtocolException, IOException {
            String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                    + "/stockcode/eisdeqty_pf.htm";
            HttpClient client = HttpClientBuilder.create().build();
            HttpGet request = new HttpGet(url);
            HttpResponse response = client.execute(request);
            return response.getEntity().getContent();
        }
    
        public static InputStream loadContentFromClasspath()
                throws ClientProtocolException, IOException {
            return TestJsoup.class.getClassLoader().getResourceAsStream(
                    "eisdeqty_pf.htm");
        }
    
    }