Wikipedia : Java library to remove wikipedia text

I downloaded wikipedia dump and now want to remove the wikipedia markup in the contents of each page. I tried writing regular expressions but they are too many to handle. I found a python library but I need a java library because, I want to integrate into my code.

Thank you.

标签： java parsing wiki wikipedia

4条回答

疯言疯语

2楼-- · 2019-01-26 09:35

Do it in two steps:

let some existing tool convert the MediaWiki mark-up into plain HTML;
convert the plain HTML into text.

The following demo:

import net.java.textilej.parser.MarkupParser;
import net.java.textilej.parser.builder.HtmlDocumentBuilder;
import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import java.io.StringReader;
import java.io.StringWriter;

public class Test {

    public static void main(String[] args) throws Exception {

        String markup = "This is ''italic'' and '''that''' is bold. \n"+
                "=Header 1=\n"+
                "a list: \n* item A \n* item B \n* item C";

        StringWriter writer = new StringWriter();

        HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer);
        builder.setEmitAsDocument(false);

        MarkupParser parser = new MarkupParser(new MediaWikiDialect());
        parser.setBuilder(builder);
        parser.parse(markup);

        final String html = writer.toString();
        final StringBuilder cleaned = new StringBuilder();

        HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
                public void handleText(char[] data, int pos) {
                    cleaned.append(new String(data)).append(' ');
                }
        };
        new ParserDelegator().parse(new StringReader(html), callback, false);

        System.out.println(markup);
        System.out.println("---------------------------");
        System.out.println(html);
        System.out.println("---------------------------");
        System.out.println(cleaned);
    }
}

produces:

This is ''italic'' and '''that''' is bold. 
=Header 1=
a list: 
* item A 
* item B 
* item C
---------------------------
<p>This is <i>italic</i> and <b>that</b> is bold. </p><h1 id="Header1">Header 1</h1><p>a list: </p><ul><li>item A </li><li>item B </li><li>item C</li></ul>
---------------------------
This is  italic  and  that  is bold. Header 1 a list: item A item B item C

Where do you download the java packages you are importing?

Here: Web Archive link of download.java.net/maven/2/net/java/textile-j/2.2

0人赞添加讨论(0) 举报

\"骚年 ilove

3楼-- · 2019-01-26 09:36

Mylyn WikiText can convert various Wiki syntaxes into HTML and other formats. It also supports MediaWiki syntax, which is what Wikipedia uses. Although Mylyn WikiText is primarily an Eclipse plugin, it is also available as standalone library.

0人赞添加讨论(0) 举报

仙女界的扛把子

4楼-- · 2019-01-26 09:39

Try the Mediawiki text to plain text approach. You probably have to improve the PlainTextConverter class for your needs. Combined with the example for converting Wikipedia texts to HTML you can transclude template contents.

0人赞添加讨论(0) 举报

走好不送

5楼-- · 2019-01-26 09:52

If you need plain text you should use WikiClean library https://github.com/lintool/wikiclean.

I had the same problem and it looks like this was the only efficient solution that worked for me in java.

There are two usecases:

1) When you have the text not in XML format then you should add xml tags needed to do this processing. Supposing you are processing XML file earlier, and now you have the content without XML structure, then you just add xmlStartTag and xmlEndTag as in the code bellow, and it processes it.

String xmlStartTag = "<text xml:space=\"preserve\">";
String xmlEndTag = "</text>";
String articleWithXml = xmlStartTag + article.getText() + xmlEndTag;
WikiClean cleaner = new WikiClean.Builder().build();
String plainWikiText = cleaner.clean(articleWithXml);

2) When you are reading the Wikipedia dump file directly (xml file), in that case you just pass it through the file and it goes through.

WikiClean cleaner = new WikiClean.Builder().build();
String plainWikiText = cleaner.clean(XMLFileContents);

0人赞添加讨论(0) 举报

Wikipedia : Java library to remove wikipedia text

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间