i have a serious problem.
i would like to extract the content from tag such as:
<div class="main-content">
<div class="sub-content">Sub content here</div>
Main content here </div>
output i would expect is:
Sub content here
Main content here
i've tried using regex, but the result isn't so impressive.
By using:
Pattern.compile("<div>(\\S+)</div>");
would return all the strings before the first <*/div> tag
so, could anyone help me pls?
I'd recommend avoiding regex for parsing HTML. You can easily do what you ask by using Jsoup:
public static void main(String[] args) {
String html = "<html><head/><body><div class=\"main-content\">" +
"<div class=\"sub-content\">Sub content here</div>" +
"Main content here </div></body></html>";
Document document = Jsoup.parse(html);
Elements divs = document.select("div");
for (Element div : divs) {
System.out.println(div.ownText());
}
}
In response to comment: if you want to put the content of the div
elements into an array of String
s you can simply do:
String[] divsTexts = new String[divs.size()];
for (int i = 0; i < divs.size(); i++) {
divsTexts[i] = divs.get(i).ownText();
}
In response to comment: if you have nested elements and you want to get own text for each element than you can use jquery multiple selector syntax. Here's an example:
public static void main(String[] args) {
String html = "<html><head/><body><div class=\"main-content\">" +
"<div class=\"sub-content\">" +
"<p>a paragraph <b>with some bold text</b></p>" +
"Sub content here</div>" +
"Main content here </div></body></html>";
Document document = Jsoup.parse(html);
Elements divs = document.select("div, p, b");
for (Element div : divs) {
System.out.println(div.ownText());
}
}
The code above will parse the following HTML:
<html>
<head />
<body>
<div class="main-content">
<div class="sub-content">
<p>a paragraph <b>with some bold text</b></p>
Sub content here</div>
Main content here</div>
</body>
</html>
and print the following output:
Main content here
Sub content here
a paragraph
with some bold text
<div class="main-content" id="mainCon">
<div class="sub-content" id="subCon">Sub content here</div>
Main content here </div>
From this code if you want to get the result you have mentioned
Use document.getElementById("mainCon").innerHTML
it will give Main content here along with sub div but you parse that thing.
And similarly for sub-div you can use the above code sniplet i.e. document.getElementById("subCon").innerHTML