Is there a good way to remove HTML from a Java string? A simple regex like
replaceAll("\\<.*?>","")
will work, but things like &
wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*?
in the regex will disappear).
I think that the simpliest way to filter the html tags is:
My 5 cents:
If the user enters
<b>hey!</b>
, do you want to display<b>hey!</b>
orhey!
? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:but you will run into issues if the user enters something malformed, like
<bhey!</b>
.You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.
The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.
The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".
So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):
Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.
ref : Remove HTML tags from a file to extract only the TEXT
I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:
instead of this: