ElasticSearch : Can we apply both n-gram and langu

2019-02-27 19:14发布

问题:

Thanks a lot @Random , I have modified the mapping as follows. For testing I have used "movie" as my type for indexing. Note: I have added search_analyzer also. I was not getting proper results without that. However I have following doubts for using search_analyzer.

1] Can we use custom search_analyzer in case of language analyzers ?
2] am I getting all the results due to n-gram analyzer I have used and not due to english analyzer?

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_ngram": {
                    "type": "custom",
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer",
                        "ngram_filter"
                    ],
                    "tokenizer": "whitespace"
                },
                "search_analyzer":{
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": "lowercase"
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 1,
                    "max_gram": 25
                }
            }
        }
    },
      "mappings": {
    "movie": {
      "properties": {
        "title": {
          "type": "string",
          "fields": {
            "en": {
              "type":     "string",
              "analyzer": "english_ngram",
              "search_analyzer": "search_analyzer"
            }
          }
        }
      }
    }
  }
}

Update :

Using search analyzer also is not working consistently.and need more help with this.Updating question with my findings.

I used following mapping as suggested (Note: This mapping does not use search analyzer), for simplicity lets consider only English analyzer.

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_ngram": {
                    "type": "custom",
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer",
                        "ngram_filter"
                    ],
                    "tokenizer": "standard"
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 25
                }
            }
        }
    }
}

Created index:

PUT http://localhost:9200/movies/movie/1

{"title":"$peci@l movie"}

Tried following query:

GET http://localhost:9200/movies/movie/_search

    {
        "query": {
            "multi_match": {
                "query": "$peci mov",
                "fields": ["title"],
                "operator": "and"
            }
            }
        }
    }

I got no results for this, am I doing anything wrong ? I am trying to get results for:

1] Special characters
2] Partial matches
3] Space separated partial and full words

Thanks again !

回答1:

You can create a custom analyzer based on language analyzers. The only difference is that you add your ngram_filter token filter to the end of the chain. In this case you first get language-stemmed tokens (default chain) that converted to edge ngrams in the end (your filter). You can find the implementation of language analyzers here https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer in order to override them. Here is an example of this change for english language:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_ngram": {
                    "type": "custom",
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer",
                        "ngram_filter"
                    ],
                    "tokenizer": "standard"
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 25
                }
            }
        }
    }
}

UPDATE

To support special characters you can try to use whitespace tokenizer instead of standard. In this case these characters will be part of your tokens:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "english_ngram": {
                    "type": "custom",
                    "filter": [
                        "english_possessive_stemmer",
                        "lowercase",
                        "english_stop",
                        "english_stemmer",
                        "ngram_filter"
                    ],
                    "tokenizer": "whitespace"
                }
            },
            "filter": {
                "english_stop": {
                    "type": "stop"
                },
                "english_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                },
                "ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 25
                }
            }
        }
    }
}