Elastic Search in ASP.NET - using ampersand sign

2019-06-14 05:17发布

问题:

I'm new to Elastic Search in ASP.NET, and I have a problem which I'm, so far, unable to resolve.

From documentation, I've seen that & sign is not listed as a special character. Yet, when I submit my search ampersand sign is fully ignored. For example if I search for procter & gamble, & sign is fully ignored. That makes quite a lot of problems for me, because I have companies that have names like M&S. When & sign is ignored, I get basically everything that has M or S in it. If I try with exact search (M&S), I have the same problem.

My code is:

void Connect()
{            
    node = new Uri(ConfigurationManager.AppSettings["Url"]);
    settings = new ConnectionSettings(node);
    settings.DefaultIndex(ConfigurationManager.AppSettings["defaultIndex"]);
    settings.ThrowExceptions(true);
    client = new ElasticClient(settings);                        
}

private string escapeChars(string inStr) {
    var temp = inStr;
    temp = temp
        .Replace(@"\", @"\\")
        .Replace(@">",string.Empty)
        .Replace(@"<",string.Empty)
        .Replace(@"{",string.Empty)
        .Replace(@"}",string.Empty)
        .Replace(@"[",string.Empty)
        .Replace(@"]",string.Empty)
        .Replace(@"*",string.Empty)
        .Replace(@"?",string.Empty)
        .Replace(@":",string.Empty)
        .Replace(@"/",string.Empty);
    return temp;
}

And then inside one of my functions

Connect();    
ISearchResponse<ElasticSearch_Result> search_result;            
var QString = escapeChars(searchString);                  
search_result = client.Search<ElasticSearch_Result>(s => s
    .From(0)
    .Size(101)
    .Query(q => 
        q.QueryString(b => 
            b.Query(QString)
            //.Analyzer("whitespace")
            .Fields(fs => fs.Field(f => f.CompanyName))                                
        )
    )
    .Highlight(h => h
        .Order("score")
        .TagsSchema("styled")
        .Fields(fs => fs
            .Field(f => f.CompanyName)
        )
    )
);

I've tried including analyzers, but then I've found out that they change the way tokenizers split words. I haven't been able to implement changes to the tokenizer.

I would like to be able to have following scenario:

Search: M&S Company Foo Bar

Tokens: M&S Company Foo Bar + bonus is if it's possible to have M S tokens too

I'm using elastic search V5.0.

Any help is more than welcome. Including better documentation than the one found here: https://www.elastic.co/guide/en/elasticsearch/client/net-api/5.x/writing-queries.html.

回答1:

By default for a text field the analyzer applied is standard analyzer. This analyzer applies standard tokenizer along with lowercase token filter. So when you are indexing some value against that field, the standard analyzer is applied on that value and the resultant tokens are indexed against the field.

Let's understand this by e.g. For the field companyName (text type) let us assume that the value being passed is M&S Company Foo Bar while indexing a document. The resultant tokens for this value after the application of standard analyzer will be:

m
s
company
foo
bar

What you can notice is that not just whitespace but also & is used as delimiter to split and generate the tokens.

When you query against this field and don't pass any analyzer in the search query, it by default apply the same analyzer for search as well which is applied for indexing against the field. Therefore, if you search for M&S it get tokenised to M and S and thus actual search query search for these two tokens instead of M&S.

To solve this, you need to change the analyzer for the field companyName. Instead of standard analyzer you can create a custom analyzer which use whitespace tokenizer and lowercase filter (to make search case insensitive). For this you need to change the setting and mapping as below:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_lowercase": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "companyName": {
          "type": "text",
          "analyzer": "whitespace_lowercase",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

Now for the above input the tokens generated will be:

m&s
company
foo
bar

This will ensure that when searching for M&S, & is not ignored.