Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\<.*?>","")
will work, but things like &
wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*?
in the regex will disappear).
You might want to replace
<br/>
and</p>
tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...
Then HTML-decode special characters such as
&
. The result should not be considered to be sanitized.Use a HTML parser instead of regex. This is dead simple with Jsoup.
Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g.
<b>
,<i>
and<u>
.See also:
One more way can be to use com.google.gdata.util.common.html.HtmlToText class like
This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.
Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.
you can simply make a method with multiple replaceAll() like
Use this link for most common replacements you need: http://tunes.org/wiki/html_20special_20characters_20and_20symbols.html
It is simple but effective. I use this method first to remove the junk but not the very first line i.e replaceAll("\<.*?>",""), and later i use specific keywords to search for indexes and then use .substring(start, end) method to strip away unnecessary stuff. As this is more robust and you can pin point exactly what you need in the entire html page.
HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtils for a pretty good library for handling this in Java.