Remove HTML tags from a String

2018-12-31 01:38发布

Is there a good way to remove HTML from a Java string? A simple regex like


will work, but things like &amp; wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

2楼-- · 2018-12-31 01:38

I think that the simpliest way to filter the html tags is:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
3楼-- · 2018-12-31 01:38

My 5 cents:

String[] temp = yourString.split("&amp;");
String tmp = "";
if (temp.length > 1) {

    for (int i = 0; i < temp.length; i++) {
        tmp += temp[i] + "&";
    yourString = tmp.substring(0, tmp.length() - 1);
4楼-- · 2018-12-31 01:39

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:


but you will run into issues if the user enters something malformed, like <bhey!</b>.

You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.

5楼-- · 2018-12-31 01:39

The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".

So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):

import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

 * Take HTML and give back the text part while dropping the HTML tags.
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 * @author dan
 * @see <a href="">TagSoup</a> 
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));

public String getText() {
    return sb.toString();

public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {

public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {

// The methods below do not contribute to the text
public void endDocument() throws SAXException {

public void endElement(String uri, String localName, String qName)
    throws SAXException {

public void endPrefixMapping(String prefix) throws SAXException {

public void processingInstruction(String target, String data)
    throws SAXException {

public void setDocumentLocator(Locator locator) {

public void skippedEntity(String name) throws SAXException {

public void startDocument() throws SAXException {

public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {

public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
6楼-- · 2018-12-31 01:41

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);

    public void handleText(char[] text, int pos) {

    public String getText() {
        return s.toString();

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
        } catch (Exception e) {

ref : Remove HTML tags from a file to extract only the TEXT

7楼-- · 2018-12-31 01:43

I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:

noHTMLString.replaceAll("\\&.*?\\;", "");

instead of this:

html = html.replaceAll("&nbsp;","");
html = html.replaceAll("&amp;"."");
登录 后发表回答