What is proper way to remove ONLY html tags (preserve all custom/unknown tags) with JSOUP (NOT regex)?
Expected input:
<html>
<customTag>
<div> dsgfdgdgf </div>
</customTag>
<123456789/>
<123>
<html123/>
</html>
Expected output:
<customTag>
dsgfdgdgf
</customTag>
<123456789/>
<123>
<html123/>
I tried to use Cleaner with WhiteList.none(), but it removes custom tags also.
Also I tried:
String str = Jsoup.parse(html).text()
But it removes custom tags also.
This answer isn't good for me, because number of custom tags is infinity.
you might want to try something like this:
String[] tags = new String[]{"html", "div"};
Document thing = Jsoup.parse("<html><customTag><div>dsgfdgdgf</div></customTag><123456789/><123><html123/></html>");
for (String tag : tags) {
for (Element elem : thing.getElementsByTag(tag)) {
elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
elem.remove();
}
}
System.out.println(thing.getElementsByTag("body").html());
Please note that <123456789/> and <123> don't conform to the xml standard, so they get escaped. Another downside may be that you have to explicitly write down all tags you don't like (aka all html tags) and it may be sloooooow. Haven't looked at how fast this is going to run.
MFG
MiSt