Jsoup not downloading entire page

The webpage is: http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm

I want to extract all the <tr class="tr_normal"> elements using Jsoup.

The code I am using is:

Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());

But the size (1350) is smaller than actually have (1452). I copied this page onto my computer and deleted some <tr> elements. Then I ran the same code and it's correct. It looks like there are too many elements so jsoup can't read all of them?

So what's happened? Thanks!

标签： java html http web jsoup

1条回答

我欲成王，谁敢阻挡

2楼-- · 2019-07-04 02:55

~~The problem is the internal Jsoup Http Connection Handling.~~ Nothing wrong with the selector engine. ~~I didn't go deep in but there always problem with proprietary way to handle http connection.~~ I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection. The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample. Output of the program

load from file= 1452
load from http client= 1452
load from jsoup connect= 1350

load from jsoup connect using maxBodySize= 1452

package test;

import java.io.IOException;
import java.io.InputStream;

import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class TestJsoup {

    /**
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
        Elements es = doc.getElementsByClass("tr_normal");
        System.out.println("load from file= " + es.size());

        doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from http client= " + es.size());

        String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                + "/stockcode/eisdeqty_pf.htm";
        doc = Jsoup.connect(url).get();
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from jsoup connect= " + es.size());

        int maxBodySize = 2048000;//2MB (default is 1MB) 0 for unlimited size
        doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
        es = doc.getElementsByClass("tr_normal");
        System.out.println("load from jsoup connect using maxBodySize= " + es.size());
    }

    public static InputStream loadContentByHttpClient()
            throws ClientProtocolException, IOException {
        String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                + "/stockcode/eisdeqty_pf.htm";
        HttpClient client = HttpClientBuilder.create().build();
        HttpGet request = new HttpGet(url);
        HttpResponse response = client.execute(request);
        return response.getEntity().getContent();
    }

    public static InputStream loadContentFromClasspath()
            throws ClientProtocolException, IOException {
        return TestJsoup.class.getClassLoader().getResourceAsStream(
                "eisdeqty_pf.htm");
    }

}

0人赞添加讨论(0) 举报

Jsoup not downloading entire page

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间