Extracting information from page with Jsoup

2019-08-02 08:42发布

I'm trying to extract information from here with Jsoup library. Cannot grab information after js element.
I look on this page with Opera DragonFly at the each of the td elements. And here is result:

<td class="t_port">
      <script type="text/javascript">
      //<![CDATA[
        document.write(Socks^GrubMe^51959);
      //]]>
      </script>
     "1080
                "
    </td>

When I'm use view code function of any browser, he returns me same lines of code but without "1080" - information what I'm looking for. Same result I'l take when I try to grab this page with Jsoup. js code is much more or less similar. Like:

document.write(SmallBlind^NineBeforeZero^64881);

or

document.write(ProxyMoxy^DexterProxy^29182);

or something similar

 document.write(Defender^Agile^57721);


Understanding policy of this service i suppose what this js code blocks this necessary information and load it later dynamicly, through editing DOM add adding "1080" type of information. Any suggestions grab this info?

P.S: Here is my code:

Document doc = Jsoup.connect(socks4URL).post();
    Elements ips = doc.select("table.proxytbl td.t_ip");
    for (Element e : ips) {
        System.out.println("e is " + e.text());
    }
    Elements ports = doc.select("table.proxytbl td.t_port");
    for (Element e : ports) {
        System.out.println("port is " + e);
    }

1条回答
冷血范
2楼-- · 2019-08-02 08:55

First

I suppose the site uses this technique exactly to discourage people like you to scrape their information. Having said that, I just assume you understand this and give up.

Second

This side does not load the port info via ajax. It simply defines some global variables in a script tag and uses the bitwise XOR operator (^) to calculate the port number. To understand what is going on, you need to understand the XOR operator, find the little script that is loaded inline in the source (hint: script tag inside the div with id="incontent"). Here is what I got, but that might be a dynamically generated script, so it might differ from call to call:

<script type="text/javascript">
//<![CDATA[
  BigProxy = 13097;BigGoodProxy = 42249^BigProxy;GrubMe = BigGoodProxy^BigProxy;Defender = 16593^BigGoodProxy;Polymorth = 32164^60129;Xorg = Defender^BigProxy;DexterProxy = Defender^Defender;SmallBlind = 56306^22478;Agile = 7797^61126;Socks = BigProxy^SmallBlind;DontGrubMe = BigProxy^45134;Xinemara = 64225^38807;HttpSocks = Socks^BigGoodProxy;BigBlind = GrubMe^41530;NineBeforeZero = 8868^38743;SmallProxy = HttpSocks^Socks;ProxyMoxy = Polymorth^41915;
//]]>
</script>

Now you can parse the data and recreate variables with the same values. Just parse the port field and interpret the little XOR calculation. For example:

document.write(SmallBlind^BigProxy^47917);

According to the above script SmallBlind=35900 and BigProxy=13097 (after evaluation!)

so the calculus is 35900 ^ 13097 ^ 47917 = 1080

Third

Just subscribe to one of the many services that send you ready to use socks proxy lists, if you need them so badly :)

查看更多
登录 后发表回答