“Failed create pod sandbox” pod error in AWS Kuber

2019-07-15 06:36发布

问题:

Summary of the issue We have observed on several occasions that our cluster runs in a problem where one or more pods on one or more nodes are not starting (container or containers within the pod are not starting). The pod shows "Failed create pod sandbox" error. Restarting docker or kubelet on the "affected" nodes does no fix the problem. Also, terminating and recreating affected EC2 instances does not solve the issue. If a pod (both ones having failed to start and "healthy" ones) are rescheduled onto other nodes, they fail to start with the same error. So, it seems that a complete cluster is broken as soon as at least one pod fails to start with the "Failed create pod sandbox" error. We have found no way to fix the cluster - only complete recreation of the cluster helps getting our application up and running.

Environment * Kubernetes 1.9.3 * Cluster created using kops (v 1.8) running on AWS (private network topology) * Network: weave-net * Cluster: 1 master, 3 nodes

Incident instance timeline

  • We have executed rolling cluster update with kops to start nodes and masters with a new AMI we have built (based on kops ami k8s-1.8-debian-jessie-amd64-hvm-ebs-2017-11-27). Adjusting kops AMI was never a problem - we apply Linux updates and a few minor changes unrelated to Kubernetes.
  • Noticed in the Kubernetes dashboard that one of the pods (broker pod) has failed to start with the following error: 0/4 nodes are available: 1 NodeUnschedulable, 1 PodToleratesNodeTaints, 3 NoVolumeZoneConflich. Failed create pod sandbox.
  • Another pod created by a daemon set (fluentd pod) scheduled on the same node as the above one had slightly different error: network is not ready:[runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready:cni clnfig uninitialized] *The affected node is healthy according to Kubernetes. Looking with kubectl describe nodes, the affected node has more than enough resources to run pods
  • Looking at the kubelet log we observed the following for the broker pod:

    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.364224    1346 kubelet.go:1835] SyncLoop (ADD, "api"): "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)"
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512797    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "authservice-ca" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-authservice-ca") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512862    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "broker-internal-cert" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-internal-cert") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512934    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "ssl-certs" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-ssl-certs") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512958    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.512980    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "broker-prometheus-config" (UniqueName: "kubernetes.io/configmap/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-prometheus-config") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.513002    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-vrhqr" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-default-token-vrhqr") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: E0316 08:29:54.513117    1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:29:55.013051677 +0000 UTC m=+33.865177055 (durationBeforeRetry 500ms). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613329    1346 reconciler.go:262] operationExecutor.MountVolume started for volume "broker-prometheus-config" (UniqueName: "kubernetes.io/configmap/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-prometheus-config") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613544    1346 reconciler.go:262] operationExecutor.MountVolume started for volume "default-token-vrhqr" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-default-token-vrhqr") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613614    1346 reconciler.go:262] operationExecutor.MountVolume started for volume "authservice-ca" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-authservice-ca") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613660    1346 reconciler.go:262] operationExecutor.MountVolume started for volume "broker-internal-cert" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-internal-cert") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.613698    1346 reconciler.go:262] operationExecutor.MountVolume started for volume "ssl-certs" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-ssl-certs") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.616720    1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "broker-prometheus-config" (UniqueName: "kubernetes.io/configmap/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-prometheus-config") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.621948    1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "ssl-certs" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-ssl-certs") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.624467    1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "authservice-ca" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-authservice-ca") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.626039    1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "broker-internal-cert" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-broker-internal-cert") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:54 ip-172-20-85-48 kubelet[1346]: I0316 08:29:54.626604    1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "default-token-vrhqr" (UniqueName: "kubernetes.io/secret/8ead64a3-28f3-11e8-b520-025c267c6ea8-default-token-vrhqr") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:55 ip-172-20-85-48 kubelet[1346]: I0316 08:29:55.014972    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:55 ip-172-20-85-48 kubelet[1346]: E0316 08:29:55.015086    1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:29:56.015048945 +0000 UTC m=+34.867174315 (durationBeforeRetry 1s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
    Mar 16 08:29:56 ip-172-20-85-48 kubelet[1346]: I0316 08:29:56.017902    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:56 ip-172-20-85-48 kubelet[1346]: E0316 08:29:56.018024    1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:29:58.017982038 +0000 UTC m=+36.870107444 (durationBeforeRetry 2s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
    Mar 16 08:29:58 ip-172-20-85-48 kubelet[1346]: I0316 08:29:58.023714    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:29:58 ip-172-20-85-48 kubelet[1346]: E0316 08:29:58.023871    1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:30:02.023814124 +0000 UTC m=+40.875939520 (durationBeforeRetry 4s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
    Mar 16 08:30:02 ip-172-20-85-48 kubelet[1346]: I0316 08:30:02.033951    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:30:02 ip-172-20-85-48 kubelet[1346]: E0316 08:30:02.034045    1346 nestedpendingoperations.go:263] Operation for "\"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\"" failed. No retries permitted until 2018-03-16 08:30:10.034017896 +0000 UTC m=+48.886143256 (durationBeforeRetry 8s). Error: "Volume has not been added to the list of VolumesInUse in the node's volume status for volume \"pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa\" (UniqueName: \"kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280\") pod \"broker-0\" (UID: \"8ead64a3-28f3-11e8-b520-025c267c6ea8\") "
    Mar 16 08:30:10 ip-172-20-85-48 kubelet[1346]: I0316 08:30:10.055813    1346 reconciler.go:217] operationExecutor.VerifyControllerAttachedVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:30:10 ip-172-20-85-48 kubelet[1346]: I0316 08:30:10.060992    1346 operation_generator.go:1111] Controller attach succeeded for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8") device path: "/dev/xvdcr"
    Mar 16 08:30:10 ip-172-20-85-48 kubelet[1346]: I0316 08:30:10.156111    1346 reconciler.go:262] operationExecutor.MountVolume started for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:30:10 ip-172-20-85-48 kubelet[1346]: I0316 08:30:10.156188    1346 operation_generator.go:446] MountVolume.WaitForAttach entering for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8") DevicePath "/dev/xvdcr"
    Mar 16 08:30:11 ip-172-20-85-48 kubelet[1346]: I0316 08:30:11.156560    1346 operation_generator.go:455] MountVolume.WaitForAttach succeeded for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8") DevicePath "/dev/xvdcr"
    Mar 16 08:30:12 ip-172-20-85-48 kubelet[1346]: I0316 08:30:12.128286    1346 operation_generator.go:484] MountVolume.MountDevice succeeded for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8") device mount path "/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-central-1b/vol-04145a1c9d1a26280"
    Mar 16 08:30:12 ip-172-20-85-48 kubelet[1346]: I0316 08:30:12.370928    1346 operation_generator.go:522] MountVolume.SetUp succeeded for volume "pvc-b673d6da-26e3-11e8-aa99-02cd3728faaa" (UniqueName: "kubernetes.io/aws-ebs/aws://eu-central-1b/vol-04145a1c9d1a26280") pod "broker-0" (UID: "8ead64a3-28f3-11e8-b520-025c267c6ea8")
    Mar 16 08:30:12 ip-172-20-85-48 kubelet[1346]: I0316 08:30:12.672408    1346 kuberuntime_manager.go:385] No sandbox for pod "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" can be found. Need to start a new one
    Mar 16 08:30:15 ip-172-20-85-48 kubelet[1346]: I0316 08:30:15.337311    1346 kubelet.go:1880] SyncLoop (PLEG): "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)", event: &pleg.PodLifecycleEvent{ID:"8ead64a3-28f3-11e8-b520-025c267c6ea8", Type:"ContainerStarted", Data:"b08ea5b45ce3ba467856952ad6cc095f4b796673d7dfbf3b9c4029b6b1a75a1b"}
    Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.672940    1346 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.672956    1346 kuberuntime_manager.go:647] createPodSandbox for pod "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.673020    1346 pod_workers.go:186] Error syncing pod 8ead64a3-28f3-11e8-b520-025c267c6ea8 ("broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)"), skipping: failed to "CreatePodSandbox" for "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" with CreatePodSandboxError: "CreatePodSandbox for pod \"broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
    Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.673020    1346 pod_workers.go:186] Error syncing pod 8ead64a3-28f3-11e8-b520-025c267c6ea8 ("broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)"), skipping: failed to "CreatePodSandbox" for "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" with CreatePodSandboxError: "CreatePodSandbox for pod \"broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
    Mar 16 08:34:12 ip-172-20-85-48 kubelet[1346]: E0316 08:34:12.673020    1346 pod_workers.go:186] Error syncing pod 8ead64a3-28f3-11e8-b520-025c267c6ea8 ("broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)"), skipping: failed to "CreatePodSandbox" for "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" with CreatePodSandboxError: "CreatePodSandbox for pod \"broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)\" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
    Mar 16 08:34:13 ip-172-20-85-48 kubelet[1346]: I0316 08:34:13.301580    1346 kuberuntime_manager.go:416] Sandbox for pod "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)" has no IP address.  Need to start a new one
    Mar 16 08:34:14 ip-172-20-85-48 kubelet[1346]: I0316 08:34:14.005589    1346 kubelet.go:1880] SyncLoop (PLEG): "broker-0(8ead64a3-28f3-11e8-b520-025c267c6ea8)", event: &pleg.PodLifecycleEvent{ID:"8ead64a3-28f3-11e8-b520-025c267c6ea8", Type:"ContainerDied", Data:"b08ea5b45ce3ba467856952ad6cc095f4b796673d7dfbf3b9c4029b6b1a75a1b"}
    
  • Reproduced the problem with another pod (adapter-mqtt-vertx) by forcing it to be rescheduled on the "affected node" AFTER docker daemon and kubelet restart, produces similar result

Does anyone has an any idea what the problem here is and what would be a remedy?

回答1:

[runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready:cni clnfig uninitialized]

There are a lot of bits of information in that post, but pod sandbox woes are, to the very best of my knowledge, always SDN and/or CNI misconfigurations. I would ensure /etc/cni/net.d, /opt/cni/bin, and associated machinery are all cool. Regrettably, I have had very little experience debugging CNI stuff, as usually my problems have been a simple oversight of forgetting (or having the wrong permissions on) one of those directories. I for sure don't have anything constructive to offer for debugging Weave :-(

Hopefully your situation will also be a "simple" problem, or I hope someone else is able to weigh in who has more CNI or Weave debugging skills than I do