Elasticsearch custom analyzer for hyphens, undersc

2019-04-12 08:22发布

问题:

Admittedly, I'm not that well versed on the analysis part of ES. Here's the index layout:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "my_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "my_filter": {
                    "type": "word_delimiter",
                    "preserve_original": true
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "my_filter"]
                }
            }
        }
    }
}

You can see that I've attempted to use a custom analyzer for the hostname field. This kind of works when I use this query to find the host named "WIN_1":

{
    "query": {
        "match": {
            "hostname": "WIN_1"
        }
    }
}

The issue is that it also returns any hostname that has a 1 in it. Using the _analyze endpoint, I can see that the numbers are tokenized as well.

{
    "tokens": [
        {
            "token": "win_1",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "win",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "1",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        }
    ]
}

What I'd like to be able to do is search for WIN and get back any host that has WIN in it's name. But I also need to be able to search for WIN_1 and get back that exact host or any host with WIN_1 in it's name. Below is some test data.

{
    "ipaddress": "192.168.1.253",
    "hostname": "WIN_8_ENT_1"
}
{
    "ipaddress": "10.0.0.1",
    "hostname": "server1"
}
{
    "ipaddress": "172.20.10.36",
    "hostname": "ServA-1"
}

Hopefully someone can point me in the right direction. It could be that my simple query isn't the right approach either. I've poured over the ES docs, but they aren't real good with examples.

回答1:

You could change your analysis to use a pattern analyzer that discards the digits and under scores:

{
   "analysis": {
      "analyzer": {
          "word_only": {
              "type": "pattern",
              "pattern": "([^\p{L}]+)"
          }
       }
    }
}

Using the analyze API:

curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'

returns:

"tokens" : [ {
    "token" : "win",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
}, {
    "token" : "ent",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 2
} ]

Your mapping would become:

{
    "event": {
        "properties": {
            "ipaddress": {
                 "type": "string"
             },
             "hostname": {
                 "type": "string",
                 "analyzer": "word_only",
                 "fields": {
                     "raw": {
                         "type": "string",
                         "index": "not_analyzed"
                     }
                 }
             }
         }
    }
}

You can use a multi_match query to get the results you want:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN_1"
       }
   }
}


回答2:

Here's the analyzer and queries I ended up with:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "hostname_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "hostname_filter": {
                    "type": "pattern_capture",
                    "preserve_original": 0,
                    "patterns": [
                        "(\\p{Ll}{3,})"
                    ]
                }
            },
            "analyzer": {
                "hostname_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [  "lowercase", "hostname_filter" ]
                }
            }
        }
    }
}

Queries: Find host name starting with:

{
    "query": {
        "prefix": {
            "hostname.raw": "WIN_8"
        }
    }
}

Find host name containing:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN"
       }
   }
}

Thanks to Dan for getting me in the right direction.



回答3:

When ES 1.4 is released, there will be a new filter called 'keep types' that will allow you to only keep certain types once the string is tokenized. (i.e. keep words only, numbers only, etc).

Check it out here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter

This may be a more convenient solution for your needs in the future



回答4:

It looks like you want to apply two different types of searches on your hostname field. One for exact matches, and one for a variation of wildcard (maybe in your specific case, a prefix query).

After trying to implement all types of different searches using several different analyzers, I've found it sometimes simpler to add another field to represent each type of search you want to do. Is there a reason you do not want to add another field like the following:

{ "ipaddress": "192.168.1.253", "hostname": "WIN_8_ENT_1" "system": "WIN" }

Otherwise, you could consider writing your own custom filter that does effectively the same thing under the hood. Your filter will read in your hostname field and index the exact keyword and a substring that matches your stemming pattern (e.g. WIN in WIN_8_ENT_1).

I do not think there is any existing analyzer/filter combination that can do what you are looking for, assuming I have understood your requirements correctly.