Azure Search: Searching for singular version of a

2019-03-01 08:56发布

问题:

I have a question about a peculiar behavior I noticed in my custom analyzer (as well as in the fr.microsoft analyzer). The below Analyze API tests are shown using the “fr.microsoft” analyzer, but I saw the same exact behavior when I use my “text_contains_search_custom_analyzer” custom analyzer (which makes sense as I base it off the fr.microsoft analyzer).

UAT reported that when they search for “femme” (singular) they expect documents with “femmes” (plural) to also be found. But when I tested with the Analyze API, it appears that the Azure Search service only tokenizes plural -> plural + singular, but when tokenizing singular, only singular tokens are used. See below for examples.

Is there a way I can allow a user to search for the singular version of a word, but still include the plural version of that word in the search results? Or will I need to use synonyms to overcome this issue?

Request with “femme” { "analyzer": "fr.microsoft", "text": "femme" }

Response from “femme” { "@odata.context": "https://EXAMPLESEARCHINSTANCE.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 5, "position": 0 } ] }

Request with “femmes” { "analyzer": "fr.microsoft", "text": "femmes" }

Response from “femmes” { "@odata.context": "https://EXAMPLESEARCHINSTANCE.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 } ] }

回答1:

Just to add to yoape's response, the fr.microsoft analyzer reduces inflected words to their base form. In your case, the word femmes is reduced to its singular form femme. All cases that you described will work:

  1. Searching with the base form of a word if an inflected form was in the document.

    Let's say you're indexing a document with Vive with Femmes.
    The search engine will index the following terms: vif, vivre, vive, femme, femmes.
    If you search with any of these terms e.g., femme, the document will match.

  2. Searching with an inflected form of a word if the base form was in the document.

    Let's say you're indexing a document with teext Femme fatale.
    The search engine will index the following terms: femme, fatal, fatale.
    If you search with term femmes, the analyzer will produce also its base form. Your query will become femmes OR femme. Documents with any of these terms will match.

  3. Searching with an inflected from if another inflected form of that word was in the document.

    If you have a document with allez, terms allez and aller will be indexed.
    If you search for alle, the query becomes alle OR aller. Since both inflected forms are reduced to the same base form the document will match.

The key learning here is that the analyzer processes the documents but also query terms. Terms are normalized accounting for language specific rules.

I hope that explains it.



回答2:

You are using the Analyze API which uses text analyzers, that is not the same as searching using the Search API.

Text analyzers are what supports the search engine when building the indexes that is really what is at the bottom of a search engine. In order to structure a search index the the documents that goes in there needs to be analyzed, this is where the Analyzers come in. They are the ones that can understand different languages and can parse a text and make sense of if, i.e. splitting up words, removing stop words, understand sentences and so on. Or as they put it in the docs: https://docs.microsoft.com/en-us/rest/api/searchservice/language-support

Searchable fields undergo analysis that most frequently involves word-breaking, text normalization, and filtering out terms. By default, searchable fields in Azure Search are analyzed with the Apache Lucene Standard analyzer (standard lucene) which breaks text into elements following the "Unicode Text Segmentation" rules. Additionally, the standard analyzer converts all characters to their lower case form.

So what you are seeing is actually perfectly right, the french analyzer breaks down the word you send in and returns possible tokens from the text. For the first text it cannot find any other possible tokens than 'femme' (I guess there are no other words like 'fem' or 'femm' in French?), but for the second one it can find both 'femme' and 'femmes' in there.

So, what you are seeing is a natural function of a text analyzer.

Searching for the same text using the search API on the other hand should return documents with both 'femme' and 'femmes' in, if you have set the right analyzer (for instance fr.microsoft) for the searchable field. The default 'standard' analyzer does not handle pluralis and other inflections of the same word.