Log flood after master upgrade from 1.6.13-gke.0 t

2019-07-09 04:27发布

问题:

We have a GKE cluster with:

  • master nodes with version 1.6.13-gke.0
  • 2 node pools with version 1.6.11-gke.0

We have Stackdriver Monitoring and Logging activated.

On 2018-01-22, masters where upgraded by Google to version 1.7.11-gke.1.

After this upgrade, we have a lot of errors like these:

I  2018-01-25 11:35:23 +0000 [error]: Exception emitting record: No such file or directory @ sys_fail2 - (/var/log/fluentd-buffers/kubernetes.system.buffer..b5638802e3e04e72f.log, /var/log/fluentd-buffers/kubernetes.system.buffer..q5638802e3e04e72f.log)

I  2018-01-25 11:35:23 +0000 [warn]: emit transaction failed: error_class=Errno::ENOENT error="No such file or directory @ sys_fail2 - (/var/log/fluentd-buffers/kubernetes.system.buffer..b5638802e3e04e72f.log, /var/log/fluentd-buffers/kubernetes.system.buffer..q5638802e3e04e72f.log)" tag="docker"

I    2018-01-25 11:35:23 +0000 [warn]: suppressed same stacktrace

Those messages are flooding our logs ~ 25Gb of logs each day, and are generated by pods managed by a DaemonSet called fluentd-gcp-v2.0.9 .

We found that it's a bug fixed on 1.8 and backported to 1.7.12.

My questions are:

  1. Should we upgrade masters to version 1.7.12 ? Is it safe to do it? OR
  2. Is there any other alternative to test before upgrading?

Thanks in advance.

回答1:

First of all, the answer to question 2.

As alternatives we could have:

  • filtered fluentd to ignore logs from fluentd-gcp pods OR
  • deactivate Stackdriver monitoring and logging

To answer question 1:

We upgraded to 1.7.12 in a test environment. The process took 3 minutes. During this period of time, we could not edit our cluster nor access it with kubectl (as expected).

After the upgrade, we deleted all our pods called fluentd-gcp-* and the flood stopped instantly:

for pod in $(kubectl get pods -nkube-system | grep fluentd-gcp | awk '{print $1}'); do \
    kubectl -nkube-system delete pod $pod; \
    sleep 20; \
done;