Jsoup remove ONLY html tags

2019-09-14 04:25发布

问题:

What is proper way to remove ONLY html tags (preserve all custom/unknown tags) with JSOUP (NOT regex)?

Expected input:

<html>
  <customTag>
    <div> dsgfdgdgf </div>
  </customTag>
  <123456789/>
  <123>
  <html123/>
</html>

Expected output:

  <customTag>
     dsgfdgdgf
  </customTag>
  <123456789/>
  <123>
  <html123/>

I tried to use Cleaner with WhiteList.none(), but it removes custom tags also.

Also I tried:

String str = Jsoup.parse(html).text()

But it removes custom tags also.

This answer isn't good for me, because number of custom tags is infinity.

回答1:

you might want to try something like this:

String[] tags = new String[]{"html", "div"};
Document thing = Jsoup.parse("<html><customTag><div>dsgfdgdgf</div></customTag><123456789/><123><html123/></html>");
for (String tag : tags) {
    for (Element elem : thing.getElementsByTag(tag)) {
        elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
        elem.remove();
    }
}
System.out.println(thing.getElementsByTag("body").html());

Please note that <123456789/> and <123> don't conform to the xml standard, so they get escaped. Another downside may be that you have to explicitly write down all tags you don't like (aka all html tags) and it may be sloooooow. Haven't looked at how fast this is going to run.

MFG MiSt