Using Jsoup i try to parse the given html content. After Jsoup.parse() the html output append html, head and body tag to the input. I just want to ignore these.
Sample Input:
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
Java code:
import java.io.File;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HTMLParse {
public static void main(String args[]) throws IOException {
try{
File input = new File("/ab.html");
String html = FileUtils.readFileToString(input, null);
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
System.out.println(doc.html());
}
catch(Exception e){
e.printStackTrace();
}
}
}
Actual output:
<html><head></head><body><p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
</body></html>
Expected Output:
<p><b>This <i>is</i></b> <i>my sentence</i> of text.</p>
Please help.
You can try using the XML parser, but this doesn't always work because HTML is not always XML; it often has unterminated tags like
<img>
and<br>
. It's better to stick with the HTML parser. You can rely on there being<html>
,<head>
, and<body>
tags and they are easy to discard. Just get your fragment of HTML by selecting the body tag and ask for its HTML.To get the expected output it would actually be:
The cause:
parseBodyFragment()
as well as all otherparse()
-methods use a HTML parser by default. And those add always the HTML-Shell (<html>…</html>
,<head>…</head>
etc.).The Solution:
Just don't use a HTML-parser, use a XML-parser instead ;-)
Replace that single line and your problem is solved.
Example:
Output: