Remove HTML tags from a String-第3页回答

Is there a good way to remove HTML from a Java string? A simple regex like

 replaceAll("\\<.*?>","")

will work, but things like & wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

标签： java html parsing

27条回答

旧时光的记忆

2楼-- · 2018-12-31 01:52

Use Html.fromHtml

HTML Tags are

<a href=”…”> <b>,  <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>,  <em>, <font size=”…” color=”…” face=”…”>
<h1>,  <h2>, <h3>, <h4>,  <h5>, <h6>
<i>, <p>, <small>
<strike>,  <strong>, <sub>, <sup>, <tt>, <u>

As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.

Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.

Example

String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";

Then

Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());

Output

This is about me text that the user can put into their profile

0人赞添加讨论(0) 举报

一个人的天荒地老

3楼-- · 2018-12-31 01:52

This should work -

use this

  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

and this

  text.replaceAll('&.*?;' , "")-> this will replace all the tags which starts with "&" and ends with ";" like &nbsp;, &amp;, &gt; etc.

0人赞添加讨论(0) 举报

素衣白纱

4楼-- · 2018-12-31 01:52

To get formateed plain html text you can do that:

String BR_ESCAPED = "&lt;br/&gt;";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

To get formateed plain text change <br/> by \n and change last line by:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

0人赞添加讨论(0) 举报

琉璃瓶的回忆

5楼-- · 2018-12-31 01:53

It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    stringBean.setURL(url);
    content = stringBean.getStrings();
    return content;
}

0人赞添加讨论(0) 举报

临风纵饮

6楼-- · 2018-12-31 01:54

If you're writing for Android you can do this...

android.text.Html.fromHtml(instruction).toString()

0人赞添加讨论(0) 举报

爱死公子算了

7楼-- · 2018-12-31 01:55

Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());

0人赞添加讨论(0) 举报

Remove HTML tags from a String

Example

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间