ElasticSearch C# Nest Getting top words with 5.1

2019-09-14 08:14发布

问题:

I have an ElasticSearch object with these fields:

[Keyword]
public List<string> Tags { get; set; }
[Text]
public string Title { get; set; }

And, before I used to get the top Tags, in all the documents, using this code:

var Match = Driver.Search<Metadata>(_ => _
                  .Query(Q => Q
                  .Term(P => P.Category, (int)Category)
                     && Q.Term(P => P.Type, (int)Type))
                  .FielddataFields(F => F.Fields(F1 => F1.Tags, F2 => F2.Title))
                  .Aggregations(A => A.Terms("Tags", T => T.Field(F => F.Tags)
                  .Size(Limit))));

But with Elastic 5.1, I get an error 400 with this hint:

Fielddata is disabled on text fields by default. Set fielddata=true on [Tags] in order to load fielddata in memory by uninverting the inverted index.

Then the ES documentation about parameter mapping tells you "It usually doesn’t make sense to do so" and to "have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations".

But the only doc with this is for 5.0, and the same page for 5.1 seem to not exist.

Now, 5.1 has a page about Term Aggregation that seems to cover what I need, but there is absolutely nothing to be found in C# / Nest that I can use.

So, I'm trying to figure out how I can just get the top words, across all documents, from the Tags (where each tag is its own word; for example "New York" is not "New" and "York") and the title (where each word is its own thing) in C#.


I need to edit this post because there seems to be a deeper problem. I wrote some test code that illustrates the issue:

Let's create a simple object:

public class MyObject
{
    [Keyword]
    public string Id { get; set; }
    [Text]
    public string Category { get; set; }
    [Text(Fielddata = true)]
    public string Keywords { get; set; }
}

create the index:

var Uri = new Uri(Constants.ELASTIC_CONNECTIONSTRING);
var Settings = new ConnectionSettings(Uri)
.DefaultIndex("test")
.DefaultFieldNameInferrer(_ => _)
.InferMappingFor<MyObject>(_ => _.IdProperty(P => P.Id));   
var D = new ElasticClient(Settings);

fill the index with random stuff:

for (var i = 0; i < 10; i++)
{
    var O = new MyObject
    {
        Id = i.ToString(),
        Category = (i % 2) == 0 ? "a" : "b",
        Keywords = (i % 3).ToString()
    };

    D.Index(O);
}

and do the query:

var m = D.Search<MyObject>(s => s
    .Query(q => q.Term(P => P.Category, "a"))
    .Source(f => f.Includes(si => si.Fields(ff => ff.Keywords)))
    .Aggregations(a => a
        .Terms("Keywords", t => t
            .Field(f => f.Keywords)
            .Size(Limit)
        )
    )
);

It fails the same way as before, with a 400 and:

Fielddata is disabled on text fields by default. Set fielddata=true on [Keywords] in order to load fielddata in memory by uninverting the inverted index.

but Fielddata is set to true on [Keywords], yet it keeps complaining about it.

so, let's get crazy and modify the class this way:

public class MyObject
{
    [Text(Fielddata = true)]
    public string Id { get; set; }
    [Text(Fielddata = true)]
    public string Category { get; set; }
    [Text(Fielddata = true)]
    public string Keywords { get; set; }
}

that way everything is a Text and everything has Fielddata = true.. well, same result.

so, either I am really not understanding something simple, or it's broken or not documented :)

回答1:

It's less common that you want Fielddata; for your particular search here where you want to return just the tags and the title fields from the search query, take a look at using Source Filtering for this

var Match = client.Search<Metadata>(s => s
    .Query(q => q
        .Term(P => P.Category, (int)Category) && q
        .Term(P => P.Type, (int)Type)
    )
    .Source(f => f
        .Includes(si => si
            .Fields(
                ff => ff.Tags, 
                ff => ff.Title
            )
        )
    )
    .Aggregations(a => a
        .Terms("Tags", t => t
            .Field(f => f.Tags)
            .Size(Limit)
        )
    )
);

Fielddata needs to uninvert the inverted index into an in memory structure for aggregations and sorting. Whilst accessing this data can be very fast, it can also consume a lot of memory for a large data set.

EDIT:

Within your edit, I don't see anywhere where you create the index and explicitly map your MyObject POCO; without explicitly creating the index and mapping the POCO, Elasticsearch will automatically create the index and infer the mapping for MyObject based on the first json document that it receives, meaning Keywords will be mapped as a text field with a keyword multi_field and Fielddata will not be enabled on the text field mapping.

Here's an example to demonstrate it all working

void Main()
{
    var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
    var defaultIndex = "test";
    var connectionSettings = new ConnectionSettings(pool)
            .DefaultIndex(defaultIndex)
            .DefaultFieldNameInferrer(s => s)
            .InferMappingFor<MyObject>(m => m
                .IdProperty(p => p.Id)
            );

    var client = new ElasticClient(connectionSettings);

    if (client.IndexExists(defaultIndex).Exists)
        client.DeleteIndex(defaultIndex);

    client.CreateIndex(defaultIndex, c => c
        .Mappings(m => m
            .Map<MyObject>(mm => mm
                .AutoMap()
            )
        )
    );

    var objs = Enumerable.Range(0, 10).Select(i =>
        new MyObject
        {
            Id = i.ToString(),
            Category = (i % 2) == 0 ? "a" : "b",
            Keywords = (i % 3).ToString()
        });

    client.IndexMany(objs);

    client.Refresh(defaultIndex);

    var searchResponse = client.Search<MyObject>(s => s
        .Query(q => q.Term(P => P.Category, "a"))
        .Source(f => f.Includes(si => si.Fields(ff => ff.Keywords)))
        .Aggregations(a => a
            .Terms("Keywords", t => t
                .Field(f => f.Keywords)
                .Size(10)
            )
        )
    );

}

public class MyObject
{
    [Keyword]
    public string Id { get; set; }
    [Text]
    public string Category { get; set; }
    [Text(Fielddata = true)]
    public string Keywords { get; set; }
}

This returns

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 0.9808292,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "myobject",
        "_id" : "8",
        "_score" : 0.9808292,
        "_source" : {
          "Keywords" : "2"
        }
      },
      {
        "_index" : "test",
        "_type" : "myobject",
        "_id" : "0",
        "_score" : 0.2876821,
        "_source" : {
          "Keywords" : "0"
        }
      },
      {
        "_index" : "test",
        "_type" : "myobject",
        "_id" : "2",
        "_score" : 0.13353139,
        "_source" : {
          "Keywords" : "2"
        }
      },
      {
        "_index" : "test",
        "_type" : "myobject",
        "_id" : "4",
        "_score" : 0.13353139,
        "_source" : {
          "Keywords" : "1"
        }
      },
      {
        "_index" : "test",
        "_type" : "myobject",
        "_id" : "6",
        "_score" : 0.13353139,
        "_source" : {
          "Keywords" : "0"
        }
      }
    ]
  },
  "aggregations" : {
    "Keywords" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "0",
          "doc_count" : 2
        },
        {
          "key" : "2",
          "doc_count" : 2
        },
        {
          "key" : "1",
          "doc_count" : 1
        }
      ]
    }
  }
}

You might also consider mapping Keywords as a text field with a keyword multi_field, using the text field for unstructured search and the keyword for sorting, aggregations and structured search. This way, you get the best of both worlds and don't need to enable Fielddata

client.CreateIndex(defaultIndex, c => c
    .Mappings(m => m
        .Map<MyObject>(mm => mm
            .AutoMap()
            .Properties(p => p
                .Text(t => t
                    .Name(n => n.Keywords)
                    .Fields(f => f
                        .Keyword(k => k
                            .Name("keyword")
                        )
                    )
                )
            )
        )
    )
);

then in search use

var searchResponse = client.Search<MyObject>(s => s
    .Query(q => q.Term(P => P.Category, "a"))
    .Source(f => f.Includes(si => si.Fields(ff => ff.Keywords)))
    .Aggregations(a => a
        .Terms("Keywords", t => t
            .Field(f => f.Keywords.Suffix("keyword"))
            .Size(10)
        )
    )
);