Convert HTML to plain text in Java

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br> but other tags, e.g. <tr/>, </p> leads to a new line too.

Sample HTML pages for testing are:

Note that these are only random URLs.

I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow question to convert HTML to plain text.

Example using JSoup:

public class JSoupTest {

 @Test
 public void SimpleParse() {
  try {
   Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
   System.out.print(doc.text());

  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }
}

Example with HTMLEditorKit:

import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
 StringBuffer s;

 public Html2Text() {}

 public void parse(Reader in) throws IOException {
   s = new StringBuffer();
   ParserDelegator delegator = new ParserDelegator();
   // the third parameter is TRUE to ignore charset directive
   delegator.parse(in, this, Boolean.TRUE);
 }

 public void handleText(char[] text, int pos) {
   s.append(text);
 }

 public String getText() {
   return s.toString();
 }

 public static void main (String[] args) {
   try {
     // the HTML to convert
    URL  url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
    URLConnection conn = url.openConnection();
    BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
    String inputLine;
    String finalContents = "";
    while ((inputLine = reader.readLine()) != null) {
     finalContents += "\n" + inputLine.replace("<br", "\n<br");
    }
    BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
    writer.write(finalContents);
    writer.close();

     FileReader in = new FileReader("samples/testHtml.html");
     Html2Text parser = new Html2Text();
     parser.parse(in);
     in.close();
     System.out.println(parser.getText());
   }
   catch (Exception e) {
     e.printStackTrace();
   }
 }
}

标签： java parsing plaintext jsoup htmleditorkit

6条回答

Root（大扎）

2楼-- · 2019-04-05 14:03

Have your parser append text content and newlines to a StringBuilder.

final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
    public boolean readyForNewline;

    @Override
    public void handleText(final char[] data, final int pos) {
        String s = new String(data);
        sb.append(s.trim());
        readyForNewline = true;
    }

    @Override
    public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
            sb.append("\n");
            readyForNewline = false;
        }
    }

    @Override
    public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
        handleStartTag(t, a, pos);
    }
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

0人赞添加讨论(0) 举报

傲

3楼-- · 2019-04-05 14:07

You can use XSLT for this purpose. Take a look at this link which addresses a similar problem.

Hope it is helpful.

0人赞添加讨论(0) 举报

唯我独甜

4楼-- · 2019-04-05 14:12

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

0人赞添加讨论(0) 举报

趁早两清

5楼-- · 2019-04-05 14:13

Building on your example, with a hint from html to plain text? message:

import java.io.*;

import org.jsoup.*;
import org.jsoup.nodes.*;

public class TestJsoup
{
  public void SimpleParse()
  {
    try
    {
      Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
      // Trick for better formatting
      doc.body().wrap("<pre></pre>");
      String text = doc.text();
      // Converting nbsp entities
      text = text.replaceAll("\u00A0", " ");
      System.out.print(text);
    }
    catch (IOException e)
    {
      e.printStackTrace();
    }
  }

  public static void main(String args[])
  {
    TestJsoup tjs = new TestJsoup();
    tjs.SimpleParse();
  }
}

0人赞添加讨论(0) 举报

女痞

6楼-- · 2019-04-05 14:16

I would guess you could use the ParserCallback.

You would need to add code to support the tags that require special handling. There are:

handleStartTag
handleEndTag
handleSimpleTag

callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.

0人赞添加讨论(0) 举报

Bombasti

7楼-- · 2019-04-05 14:29

JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.

http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726 My code:

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

0人赞添加讨论(0) 举报

Convert HTML to plain text in Java

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间