How to count of fields with the same value in elas

2019-08-09 10:48发布

问题:

I have a list of communities. And I need to create aggregation query which will count all data which have the same title.

[
  {
    "_id": "56161cb3cbdad2e3b437fdc3",
    "_type": "Comunity",
    "name": "public",
    "data": [
      {
        "title": "sonder",
        "creationDate": "2015-08-22T03:43:28 -03:00",
        "quantity": 0
      },
      {
        "title": "vule",
        "creationDate": "2014-05-17T12:35:01 -03:00",
        "quantity": 0
      },
      {
        "title": "omer",
        "creationDate": "2015-01-31T04:54:19 -02:00",
        "quantity": 3
      },
      {
        "title": "sonder",
        "creationDate": "2014-05-22T05:09:36 -03:00",
        "quantity": 3
      }
    ]
  },
  {
    "_id": "56161cb3dae30517fc133cd9",
    "_type": "Comunity",
    "name": "static",
    "data": [
      {
        "title": "vule",
        "creationDate": "2014-07-01T06:32:06 -03:00",
        "quantity": 5
      },
      {
        "title": "vule",
        "creationDate": "2014-01-10T12:40:28 -02:00",
        "quantity": 1
      },
      {
        "title": "vule",
        "creationDate": "2014-01-09T09:33:11 -02:00",
        "quantity": 3
      }
    ]
  },
  {
    "_id": "56161cb32f62b522355ca3c8",
    "_type": "Comunity",
    "name": "public",
    "data": [
      {
        "title": "vule",
        "creationDate": "2014-02-03T09:55:28 -02:00",
        "quantity": 2
      },
      {
        "title": "vule",
        "creationDate": "2015-01-23T09:14:22 -02:00",
        "quantity": 0
      }
    ]
  }
]

So desire result should be

[
  {
    title: vule,
    total: 6
  },
  {
    title: omer,
    total: 1
  },
  {
    title: sonder,
    total: 1
  }
]

I wrote some aggregation queries but it still not work. How can I get desire result?

PS: I tried to create nested aggregation

ejs.Request().size(0).agg(
        ejs.NestedAggregation('comunities')
            .path('data')
            .agg(
                ejs.FilterAggregation('sonder')
                    .filter(
                    ejs.TermsFilter('data.title', 'sonder')
                ).agg(
                ejs.ValueCountAggregation('counts')
                      .field('data.title')
)
            )
    );

回答1:

You need to use terms aggregations.

Now depending on your mapping there could be two ways of doing that:

1. Your data field is stored as an subdocument

You need to run a simple terms aggregation, which in RAW json looks like:

POST /test/test/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "Grouping": {
      "terms": {
        "field": "data.title",
        "size": 0
      }
    }
  }
}

2. Your data field is stored as an nested document

You have to add a nested subaggregation before doing terms aggregation.

POST /test/test/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "Nest": {
      "nested": {
        "path": "data"
      },
      "aggs": {
        "Grouping": {
          "terms": {
            "field": "data.title",
            "size": 0
          }
        }
      }
    }
  }
}

Both will output this:

{
   "took": 125,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "Nest": {
         "doc_count": 9,
         "Grouping": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
               {
                  "key": "vule",
                  "doc_count": 6            -- The Total count you're looking for
               },
               {
                  "key": "sonder",
                  "doc_count": 2
               },
               {
                  "key": "omer",
                  "doc_count": 1
               }
            ]
         }
      }
   }
}

This, unfortunately, is just a raw query, but I imagine that it can be translated into elastic.js quite easily.

On top of that. If you're going to do aggregations, don't forget to set your fields, that you're doing aggregations on, as not_analyzed, because it will start counting individual tokens as in documentation

I, myself, would store these documens as nested ones.

Example:

Mappings:

PUT /test
{
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string"
        },
        "data": {
          "type": "nested",
          "properties": {
            "title": {
              "type": "string",
              "index": "not_analyzed",
              "fields": {
                "stemmed": {
                  "type": "string",
                  "analyzed": "standard"
                }
              }
            },
            "creationDate": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "quantity": {
              "type": "integer"
            }
          }
        }
      }
    }
  }
}

Test data:

PUT /test/test/56161cb3cbdad2e3b437fdc3
{
  "name": "public",
  "data": [
    {
      "title": "sonder",
      "creationDate": "2015-08-22T03:43:28",
      "quantity": 0
    },
    {
      "title": "vule",
      "creationDate": "2014-05-17T12:35:01",
      "quantity": 0
    },
    {
      "title": "omer",
      "creationDate": "2015-01-31T04:54:19",
      "quantity": 3
    },
    {
      "title": "sonder",
      "creationDate": "2014-05-22T05:09:36",
      "quantity": 3
    }
  ]
}

PUT /test/test/56161cb3dae30517fc133cd9
{
  "name": "static",
  "data": [
    {
      "title": "vule",
      "creationDate": "2014-07-01T06:32:06",
      "quantity": 5
    },
    {
      "title": "vule",
      "creationDate": "2014-01-10T12:40:28",
      "quantity": 1
    },
    {
      "title": "vule",
      "creationDate": "2014-01-09T09:33:11",
      "quantity": 3
    }
  ]
}

PUT /test/test/56161cb32f62b522355ca3c8
{
  "name": "public",
  "data": [
    {
      "title": "vule",
      "creationDate": "2014-02-03T09:55:28",
      "quantity": 2
    },
    {
      "title": "vule",
      "creationDate": "2015-01-23T09:14:22",
      "quantity": 0
    }
  ]
}

Actual query:

POST /test/test/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "Nest": {
      "nested": {
        "path": "data"
      },
      "aggs": {
        "Grouping": {
          "terms": {
            "field": "data.title",
            "size": 0
          }
        }
      }
    }
  }
}

P.S. "size":0 means that I'm letting Elasticsearch output all possible terms and not limiting its output as described in documentation.

The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned). If set to 0, the size will be set to Integer.MAX_VALUE.