Jsoup remove ONLY html tags

2019-09-14 04:25发布

问题:

What is proper way to remove ONLY html tags (preserve all custom/unknown tags) with JSOUP (NOT regex)?

Expected input:

<html>
  <customTag>
    <div> dsgfdgdgf </div>
  </customTag>
  <123456789/>
  <123>
  <html123/>
</html>

Expected output:

  <customTag>
     dsgfdgdgf
  </customTag>
  <123456789/>
  <123>
  <html123/>

I tried to use Cleaner with WhiteList.none(), but it removes custom tags also.

Also I tried:

String str = Jsoup.parse(html).text()

But it removes custom tags also.

This answer isn't good for me, because number of custom tags is infinity.

回答1:

you might want to try something like this:

String[] tags = new String[]{"html", "div"};
Document thing = Jsoup.parse("<html><customTag><div>dsgfdgdgf</div></customTag><123456789/><123><html123/></html>");
for (String tag : tags) {
    for (Element elem : thing.getElementsByTag(tag)) {
        elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
        elem.remove();
    }
}
System.out.println(thing.getElementsByTag("body").html());

Please note that <123456789/> and <123> don't conform to the xml standard, so they get escaped. Another downside may be that you have to explicitly write down all tags you don't like (aka all html tags) and it may be sloooooow. Haven't looked at how fast this is going to run.

MFG MiSt

Jsoup remove ONLY html tags

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮