How to search special characters(+ ! \ ? : ) in Lu

I want to search special characters in index.

I escaped all the special characters in query string but when i perform query as + on lucene in index it create query as +().

Hence it search on no fields.

How to solve this problem? My index contains these special characters.

标签： lucene

2条回答

SAY GOODBYE

2楼-- · 2019-03-19 05:10

Maybe it's not actual for the author but to be able to search special characters you need:

Create custom analyzer
Use it for indexing and searching

Example how it works for me:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Test;

import java.io.IOException;

import static org.hamcrest.Matchers.equalTo;
import static org.junit.Assert.assertThat;

public class LuceneSpecialCharactersSearchTest {

/**
 * Test that tries to search a string by some substring with each special character separately.
 */
@Test
public void testSpecialCharacterSearch() throws Exception {
    // GIVEN
    LuceneSpecialCharactersSearch service = new LuceneSpecialCharactersSearch();
    String[] luceneSpecialCharacters = new String[]{"+", "-", "&&", "||", "!", "(", ")", "{", "}", "[", "]", "^", "\"", "~", "*", "?", ":", "\\"};

    // WHEN
    for (String specialCharacter : luceneSpecialCharacters) {
        String actual = service.search("list's special-characters " + specialCharacter);

        // THEN
        assertThat(actual, equalTo(LuceneSpecialCharactersSearch.TEXT_WITH_SPECIAL_CHARACTERS));
    }
}

private static class LuceneSpecialCharactersSearch {
    private static final String TEXT_WITH_SPECIAL_CHARACTERS = "This is the list's of special-characters + - && || ! ( ) { } [ ] ^ \" ~ ? : \\ *";

    private final IndexWriter writer;

    public LuceneSpecialCharactersSearch() throws Exception {
        Document document = new Document();
        document.add(new TextField("body", TEXT_WITH_SPECIAL_CHARACTERS, Field.Store.YES));

        RAMDirectory directory = new RAMDirectory();
        writer = new IndexWriter(directory, new IndexWriterConfig(buildAnalyzer()));
        writer.addDocument(document);
        writer.commit();
    }

    public String search(String queryString) throws Exception {
        try (IndexReader reader = DirectoryReader.open(writer, false)) {
            IndexSearcher searcher = new IndexSearcher(reader);

            String escapedQueryString = QueryParser.escape(queryString).toLowerCase();

            Analyzer analyzer = buildAnalyzer();
            QueryParser bodyQueryParser = new QueryParser("body", analyzer);
            bodyQueryParser.setDefaultOperator(QueryParser.Operator.AND);


            Query bodyQuery = bodyQueryParser.parse(escapedQueryString);
            BooleanQuery query = new BooleanQuery.Builder()
                    .add(new BooleanClause(bodyQuery, BooleanClause.Occur.SHOULD))
                    .build();
            TopDocs searchResult = searcher.search(query, 1);

            return searcher.doc(searchResult.scoreDocs[0].doc).getField("body").stringValue();
        }
    }

    /**
     * Builds analyzer that is used for indexing and searching.
     */
    private static Analyzer buildAnalyzer() throws IOException {
        return CustomAnalyzer.builder()
                .withTokenizer("whitespace")
                .addTokenFilter("lowercase")
                .addTokenFilter("standard")
                .build();

    }
}
}

0人赞添加讨论(0) 举报

聊天终结者

3楼-- · 2019-03-19 05:22

If you are using the StandardAnalyzer, that will discard non-alphanum characters. Try indexing the same value with a WhitespaceAnalyzer and see if that preserves the characters you need. It might also keep stuff you don't want: that's when you might consider writing your own Analyzer, which basically means creating a TokenStream stack that does exactly the kind of processing you need.

For example, the SimpleAnalyzer implements the following pipeline:

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
   return new LowerCaseTokenizer(reader);
}

which just lower-cases the tokens.

The StandardAnalyzer does much more:

/** Constructs a {@link StandardTokenizer} filtered by a {@link
StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = new LowerCaseFilter(result);
    result = new StopFilter(enableStopPositionIncrements, result, stopSet);
    return result;
 }

You can mix & match from these and other components in org.apache.lucene.analysis, or you can write your own specialized TokenStream instances that are wrapped into a processing pipeline by your custom Analyzer.

One other thing to look at is what sort of CharTokenizer you're using. CharTokenizer is an abstract class that specifies the machinery for tokenizing text strings. It's used by some simpler Analyzers (but not by the StandardAnalyzer). Lucene comes with two subclasses: a LetterTokenizer and a WhitespaceTokenizer. You can create your own that keeps the characters you need and breaks on those you don't by implementing the boolean isTokenChar(char c) method.

0人赞添加讨论(0) 举报

How to search special characters(+ ! \ ? : ) in Lu

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间