how to extract content from
tag java

2019-04-08 21:43发布

问题:

i have a serious problem. i would like to extract the content from tag such as:

<div class="main-content">
    <div class="sub-content">Sub content here</div>
      Main content here </div>

output i would expect is:

Sub content here
Main content here

i've tried using regex, but the result isn't so impressive. By using:

Pattern.compile("<div>(\\S+)</div>");

would return all the strings before the first <*/div> tag
so, could anyone help me pls?

回答1:

I'd recommend avoiding regex for parsing HTML. You can easily do what you ask by using Jsoup:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

In response to comment: if you want to put the content of the div elements into an array of Strings you can simply do:

    String[] divsTexts = new String[divs.size()];
    for (int i = 0; i < divs.size(); i++) {
        divsTexts[i] = divs.get(i).ownText();
    }

In response to comment: if you have nested elements and you want to get own text for each element than you can use jquery multiple selector syntax. Here's an example:

public static void main(String[] args) {
    String html = "<html><head/><body><div class=\"main-content\">" +
            "<div class=\"sub-content\">" +
            "<p>a paragraph <b>with some bold text</b></p>" +
            "Sub content here</div>" +
            "Main content here </div></body></html>";
    Document document = Jsoup.parse(html);
    Elements divs = document.select("div, p, b");
    for (Element div : divs) {
        System.out.println(div.ownText());
    }
}

The code above will parse the following HTML:

<html>
<head />
<body>
<div class="main-content">
<div class="sub-content">
<p>a paragraph <b>with some bold text</b></p>
Sub content here</div>
Main content here</div>
</body>
</html>

and print the following output:

Main content here
Sub content here
a paragraph
with some bold text


回答2:

<div class="main-content" id="mainCon">
    <div class="sub-content" id="subCon">Sub content here</div>
 Main content here </div>

From this code if you want to get the result you have mentioned

Use document.getElementById("mainCon").innerHTML it will give Main content here along with sub div but you parse that thing.

And similarly for sub-div you can use the above code sniplet i.e. document.getElementById("subCon").innerHTML