JSoup not showing all the html in Java (td and tr

2020-03-31 05:53发布

问题:

I'm having trouble getting all the html code under the tags. Here is my current code:

Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155/what-is-the-fastest-way-to-scrape-html-webpage-in-android").get();
Elements desc = document.select("tr");

System.out.println(desc.toString());

It's for that question, and I'm trying to get the text from the question's description. But I'm getting not getting certain tr or td tags like the ones for the question. Here is td tag I'm trying to get:

<td class="postcell">

Under that tag is the actual post. Now when I print out what I'm actually getting, I'm getting a ton of empty td tags and some comments, but not the actual post.

 <tr id="comment-37956942" class="comment ">
 <td>
 <table>
 <tbody>
 <tr>
  <td class=" comment-score"> &nbsp;&nbsp; </td>
  <td> &nbsp; </td>
  </tr>
</tbody>
</table> </td>
 <td class="comment-text">
<div style="display: block;" class="comment-body">
 <span class="comment-copy">You shouldn't parse HTML with regexes: <a   href="http://blog.codinghorror.com/parsing-html-the-cthulhu-way/" rel="nofollow">blog.codinghorror.com/parsing-html-the-cthulhu-way</a></span> –&nbsp;
 ﹕    <a href="/users/25612/motob%c3%b3i" title="469 reputation" class="comment-user">motobói</a>

And it keeps on going with empty td and tr tags. I can't find the actual question. Anyone know why this is happening?

Essentially, I just want the text from the question's post, and I don't know how to get it, so it would be nice if someone could show me how to get the text.

回答1:

Jsoup is a parser. That means that it can't execute any javascript code, that could generate html. When you encounter this problem the only way to retrieve that content is through a headless browser, that includes a javascript engine. A popular library is selenium webdriver.

In order to determine if the content you are trying to parse is generated in the server (static content) or in the client (dynamic content-javascript generated) you can do the following:

  1. Visit the page you want to parse
  2. Press Ctrl + U

The steps above will open a new tab that contains the content that jsoup receives. If the content you need is not there, then it's generated by javascript.

Follow the steps and search for the content. If it's there, but jsoup still has problems, then most probably the case is that the site considers you a bot or a mobile device. Try setting the userAgent of a desktop browser and see what happens.

Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155/what-is-the-fastest-way-to-scrape-html-webpage-in-android").userAgent("USER_AGENT_HERE").get();

Most importantly, when the site exposes and API for the users to extract information programmatically then it's better to just use that. Stackoverflow has an API available