Logstash + Kibana terms panel without breaking wor

2019-03-29 21:03发布

问题:

I have a Java application that writes to a log file in json format. The fields that come in the logs are variable. The logstash reads this logfile and sends it to Kibana.

I've configured the logstash with the following file:

input {
        file {
                path => ["[log_path]"]
                codec => "json"
        }
}

filter{
        json {
                source => "message"
        }

        date {
                match => [ "data", "dd-MM-yyyy HH:mm:ss.SSS" ]
                timezone => "America/Sao_Paulo"
        }
}

output {
        elasticsearch_http {
                flush_size => 1
                host => "[host]"
                index => "application-%{+YYYY.MM.dd}"
        }
}

I've managed to show correctly everything in Kibana without any mapping. But when I try to create a terms panel to show a count of the servers who sent those messages I have a problem. I have a field called server in my json, that show the servers name (like: a1-name-server1), but the terms panel split the server name because of the "-". Also I would like to count the number of times that a error message appears, but the same problem occurs, because the terms panel split the error message because of the spaces.

I'm using Kibana 3 and Logstash 1.4. I've searched a lot on the web and couldn't find any solution. I also tried using the .raw from logstash, but it didn't work.

How can I manage this?

Thanks for the help.

回答1:

Your problem here is that your data is being tokenized. This is helpful to make any search over your data. ES (by default) will split your field message split into different parts to be able to search them. For example you may want to search for the word ERROR in your logs, so you probably would like to see in the results messages like "There was an error in your cluster" or "Error processing whatever". If you don't analyze the data for that field with tokenizers, you won't be able to search like this.

This analyzed behaviour is helpful when you want to search things, but it doesn't allow you to group when different messages that have the same content. This is your usecase. The solution to this is to update your mapping putting not_analyzed for that specific field that you don't want to split into tokens. This will probably work for your host field, but will probably break the search.

What I usually do for these kind of situations is to use index templates and multifields. The index template allow me to set a mapping for every index that match a regex and the multifields allow me to have the analyzed and not_analyzed behaviour in a same field.

Using the following query would do the job for your problem:

curl -XPUT https://example.org/_template/name_of_index_template -d '
{
    "template": "indexname*",
    "mappings": {
        "type": {
            "properties": {
               "field_name": {
                  "type": "multi_field",
                  "fields": {
                     "field_name": {
                         "type": "string",
                         "index": "analyzed"
                     },
                     "untouched": {
                         "type": "string",
                         "index": "not_analyzed"
                     }                      
                 }
            }
        }
    }
}'

And then in your terms panel you can use field.untouched, to consider the entire content of the field when you calculate the count of the different elements.

If you don't want to use index templates (maybe your data is in a single index), setting the mapping with the Put Mapping API would do the job too. And if you use multifields, there is no need to reindex the data, because from the moment that you set the new mapping for the index, the new data will be duplicated in these two subfields (field_name and field_name.untouched). If you just change the mapping from analyzed to not_analyzed you won't be able to see any change until you reindex all your data.



回答2:

Since you didn't define a mapping in elasticsearch, the default settings takes place for every field in your type in your index. The default settings for string fields (like your server field) is to analyze the field, meaning that elastic search will tokenize the field contents. That is why its splitting your server names to parts.

You can overcome this issue by defining a mapping. You don't have to define all your fields, but only the ones that you don't want elasticsearch to analyze. In your particular case, sending the following put command will do the trick:

http://[host]:9200/[index_name]/_mapping/[type]

{
    "type" : {
        "properties" : {
            "server" : {"type" : "string", "index" : "not_analyzed"}
        }
    }
}

You can't do this on an already existing index because switching from analyzed to not_analyzed is a major change in the mapping.