Elasticsearch - query primary and secondary attrib

2019-08-18 08:23发布

问题:

I'm using elasticsearch to query data that originally was exported out of several relational databases that had a lot of redundencies. I now want to perform queries where I have a primary attribute and one or more secondary attributes that should match. I tried using a bool query with a must term and a should term, but that doesn't seem to work for my case, which may look like this:

Example:

I have a document with fullname and street name of a user and I want to search for similiar users in different indices. So the best match for my query should be the best match on fullname and best match on streetname field. But since the original data has a lot of redundencies and inconsistencies the field fullname (which I manually created out of fields name1, name2, name3) may contain the same name multiple times and it seems that elasticsearch ranks a double match in a must field higher than a match in a should attribute.

That means, I want to query for John Doe Back Street with the following sample data:

{
    "fullname" : "John Doe John and Jane",
    "street" : "Main Street"

}
{
    "fullname" : "John Doe",
    "street" : "Back Street"

}

Long story short, I want to query for a main attribute fullname - John Doe and secondary attribute street - Back Street and want the second document to be the best match and not the first because it contains John multiple times.

回答1:

Manipulation of relevance in Elasticsearch is not the easiest part. Score calculation is based on three main parts:

  • Term frequency
  • Inverse document frequency
  • Field-length norm

Shortly:

  • the often the term occurs in field, the MORE relevant is
  • the often the term occurs in entire index, the LESS relevant is
  • the longer the term is, the MORE relevant is

I recommend you to read below materials:

  • What Is Relevance?
  • Theory Behind Relevance Scoring
  • Controlling Relevance and subpages

If in general, in your case, result of fullname is more important than from street you can boost importance of the first one. Below you have example code base on my working code:

{
  "query": {
    "multi_match": {
      "query": "john doe",
      "fields": [
        "fullname^10",
        "street"
      ]
    }
  }
}

In this example result from fullname is ten times (^10) much important than result from street. You can try to manipulate the boost or use other ways to control relevance but as I mentioned at the beginning - it is not the easiest way and everything depends on your particular situation. Mostly because of "inverse document frequency" part which considers terms from entire index - each next added document to index will probably change the score of the same search query.

I know that I did not answer directly but I hope to helped you to understand how this works.