Example using WikipediaTokenizer in Lucene

I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it.

Thank you.

标签： java parsing programming-languages lucene wikipedia

3条回答

甜甜的少女心

2楼-- · 2019-02-20 15:22

In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.

Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.

...
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);

while (tf.incrementToken()) {
  String tokText = termAtt.term();
  //System.out.println("Text: " + tokText + " Type: " + token.type());
  String expectedType = (String) tcm.get(tokText);
  assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
  assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
  count++;
  if (typeAtt.type().equals(WikipediaTokenizer.ITALICS)  == true){
    numItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS)  == true){
    numBoldItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY)  == true){
    numCategory++;
  }
  else if (typeAtt.type().equals(WikipediaTokenizer.CITATION)  == true){
    numCitation++;
  }
}
...

0人赞添加讨论(0) 举报

做个烂人

3楼-- · 2019-02-20 15:25

WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));

Token token = new Token();

token = tf.next(token);

http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html

Regards

0人赞添加讨论(0) 举报

Root（大扎）

4楼-- · 2019-02-20 15:29

public class WikipediaTokenizerTest { static Logger logger = Logger.getLogger(WikipediaTokenizerTest.class); protected static final String LINK_PHRASES = "click [[link here again]] click [http://lucene.apache.org here again] [[Category:a b c d]]";

public WikipediaTokenizer testSimple() throws Exception {
    String text = "This is a [[Category:foo]]";
    return new WikipediaTokenizer(new StringReader(text));
}
public static void main(String[] args){
    WikipediaTokenizerTest wtt = new WikipediaTokenizerTest();

    try {
        WikipediaTokenizer x = wtt.testSimple();

        logger.info(x.hasAttributes());

        Token token = new Token();
        int count = 0;
        int numItalics = 0;
        int numBoldItalics = 0;
        int numCategory = 0;
        int numCitation = 0;

        while (x.incrementToken() == true) {
            logger.info("seen something");
        }

    } catch(Exception e){
        logger.error("Exception while tokenizing Wiki Text: " + e.getMessage());
    }


}

0人赞添加讨论(0) 举报

Example using WikipediaTokenizer in Lucene

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间