The webpage is: http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm
I want to extract all the <tr class="tr_normal">
elements using Jsoup.
The code I am using is:
Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());
But the size (1350) is smaller than actually have (1452).
I copied this page onto my computer and deleted some <tr>
elements. Then I ran the same code and it's correct. It looks like there are too many elements so jsoup can't read all of them?
So what's happened? Thanks!
The problem is the internal Jsoup Http Connection Handling.Nothing wrong with the selector engine.I didn't go deep in but there always problem with proprietary way to handle http connection.I would recommend to replace it with HttpClient - http://hc.apache.org/ . If you can't add http client as dependencies, you might want to check Jsoup source code in handling http connection. The issue is the default maxBodySize of Jsoup.Connection. Please refer to updated answer. *I still keep HttpClient code as sample. Output of the programload from jsoup connect using maxBodySize= 1452