Remove HTML tags from a String

2018-12-31 01:38发布

Is there a good way to remove HTML from a Java string? A simple regex like

 replaceAll("\\<.*?>","") 

will work, but things like &amp; wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

27条回答
深知你不懂我心
2楼-- · 2018-12-31 01:38

I think that the simpliest way to filter the html tags is:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}
查看更多
孤独寂梦人
3楼-- · 2018-12-31 01:38

My 5 cents:

String[] temp = yourString.split("&amp;");
String tmp = "";
if (temp.length > 1) {

    for (int i = 0; i < temp.length; i++) {
        tmp += temp[i] + "&";
    }
    yourString = tmp.substring(0, tmp.length() - 1);
}
查看更多
浅入江南
4楼-- · 2018-12-31 01:39

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:

replaceAll("\\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!</b>.

You can also check out JTidy which will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.

查看更多
几人难应
5楼-- · 2018-12-31 01:39

The accepted answer did not work for me for the test case I indicated: the result of "a < b or b > c" is "a b or b > c".

So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):

import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Take HTML and give back the text part while dropping the HTML tags.
 *
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 *
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> 
 */
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {
}

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    reader.setContentHandler(this);
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));
}

public String getText() {
    return sb.toString();
}

@Override
public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {
    sb.append(ch[idx+start]);
    }
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {
    sb.append(ch);
}

// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
    throws SAXException {
}

@Override
public void endPrefixMapping(String prefix) throws SAXException {
}


@Override
public void processingInstruction(String target, String data)
    throws SAXException {
}

@Override
public void setDocumentLocator(Locator locator) {
}

@Override
public void skippedEntity(String name) throws SAXException {
}

@Override
public void startDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {
}

@Override
public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
}
}
查看更多
冷夜・残月
6楼-- · 2018-12-31 01:41

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

ref : Remove HTML tags from a file to extract only the TEXT

查看更多
一个人的天荒地老
7楼-- · 2018-12-31 01:43

I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:

noHTMLString.replaceAll("\\&.*?\\;", "");

instead of this:

html = html.replaceAll("&nbsp;","");
html = html.replaceAll("&amp;"."");
查看更多
登录 后发表回答