Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\<.*?>","")
will work, but things like &
wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*?
in the regex will disappear).
One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:
Remove HTML tags from string. Somewhere we need to parse some string which is received by some responses like Httpresponse from the server.
So we need to parse it.
Here I will show how to remove html tags from string.
One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with "\n".
Alternatively, one can use HtmlCleaner:
The accepted answer of doing simply
Jsoup.parse(html).text()
has 2 potential issues (with JSoup 1.7.3):<script>
into<script>
If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:
Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.
And here is a bunch of test cases (input to output):
If you find a way to make it better, please let me know.
On Android, try this: