I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation.
I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed.
I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode.
So, is it possible to save "The The" in the index but not disabling the stop words at all?
You can use the synonym filter to convert
The The
into a single token egthethe
which won't be removed by the stopwords filter.First, configure the analyzer:
Then test it with the string
"The The The Who"
."The The"
has been tokenized as"the the"
, and"The Who"
as"who"
because the preceding"the"
was removed by the stopwords filter.To stop or not to stop
Which brings us back to whether we should include stopwords or not? You said:
What do you mean by that? Explode how? Index size? Performance?
Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.
Indexing stopwords won't have a huge impact on index size. For instance, to index the word
the
means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.Actually, the bigger problem is that
the
is very common and thus will have a low impact on relevance, so a search for"The The concert Madrid"
will preferMadrid
over the other terms. This can be mitigated by using a shingle filter, which would result in these tokens:While
the
may be common,the the
isn't and so will rank higher.You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.
We can use a multi-field to analyze the
text
field in two different ways:Then use a
multi_match
query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example thetext.shingle^2
means that we want to boost that field by 2: