I need to write a custom filter for solr analyzer phase. The idea is to first tokenize the input business name by whitespace then apply a set of filters for lower case, patterns replacement and removing the stop words. After these filters, I want to merge (concatenate) all the token into one token and then apply the NGramFilterFactory for generating N-Grams from the token.
The reason I want to combine the all the token (generated initially from business name) is that I would not miss the tokens (whose length is less then N, in NGramFilter) from indexing in the solr and user might not insert the proper spaces while entering the business name. Please let me know for more clarification.
I made an attempt to write one custom filter for the same but this is not working properly and I am able to understand the behavior of it.
When I query the name "apple" then it return n1 number of results.
when I query the name "computers" then it returns n2 results.
when I query the name "apple computers" then it returns n3 results.
when I query the name "computers apple" then it returns n4 results.
Here n3 < (n1,n2) and n3 != n4
Here is the code: I am using solr 4.10.2 version and included same solr-core jars.
public class ConcatFilter extends TokenFilter {
private CharTermAttribute charTermAtt;
private StringBuilder builder = new StringBuilder();
public ConcatFilter(TokenStream input)
{
super(input);
charTermAtt = addAttribute(CharTermAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
if(input.incrementToken()) {
int len = charTermAtt.length();
char buffer[] = charTermAtt.buffer();
builder.append(buffer, 0, len);
char[] newBuffer = builder.toString().toCharArray();
int newLength = builder.length();
charTermAtt.setEmpty();
charTermAtt.copyBuffer(newBuffer, 0, newLength);
charTermAtt.setLength(newLength);
return true;
} else {
builder.delete(0, builder.length());
return false;
}
}
}