Remove HTML tags from a String

2018-12-31 01:38发布

Is there a good way to remove HTML from a Java string? A simple regex like

 replaceAll("\\<.*?>","") 

will work, but things like &amp; wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

27条回答
何处买醉
2楼-- · 2018-12-31 01:57

One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:

InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())
查看更多
查无此人
3楼-- · 2018-12-31 01:59

Remove HTML tags from string. Somewhere we need to parse some string which is received by some responses like Httpresponse from the server.

So we need to parse it.

Here I will show how to remove html tags from string.

    // sample text with tags

    string str = "<html><head>sdfkashf sdf</head><body>sdfasdf</body></html>";



    // regex which match tags

    System.Text.RegularExpressions.Regex rx = new System.Text.RegularExpressions.Regex("<[^>]*>");



    // replace all matches with empty strin

    str = rx.Replace(str, "");



    //now str contains string without html tags
查看更多
有味是清欢
4楼-- · 2018-12-31 01:59

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".

String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
    html = html.replace(tag, NEW_LINE_MARK+tag);
}

String text = Jsoup.parse(html).text();

text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");
查看更多
与风俱净
5楼-- · 2018-12-31 02:02

Alternatively, one can use HtmlCleaner:

private CharSequence removeHtmlFrom(String html) {
    return new HtmlCleaner().clean(html).getText();
}
查看更多
其实,你不懂
6楼-- · 2018-12-31 02:03

The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):

  • It removes line breaks from the text
  • It converts text &lt;script&gt; into <script>

If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:

// breaks multi-level of escaping, preventing &amp;lt;script&amp;gt; to be rendered as <script>
String replace = input.replace("&amp;", "");
// decode any encoded html, preventing &lt;script&gt; to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);

Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.

And here is a bunch of test cases (input to output):

{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"&lt;script&gt;", ""},
{"&amp;lt;script&amp;gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and & preserved", "\" ' > < \n \\ é å à ü and & preserved"}

If you find a way to make it better, please let me know.

查看更多
笑指拈花
7楼-- · 2018-12-31 02:04

On Android, try this:

String result = Html.fromHtml(html).toString();
查看更多
登录 后发表回答