Kubernetes 1.7 on Google Cloud: FailedSync Error s

2019-07-13 11:47发布

问题:

My Kubernetes pods and containers are not starting. They are stuck in with the status ContainerCreating.

I ran the command kubectl describe po PODNAME, which lists the events and I see the following error:

Type        Reason            Message
Warning     FailedSync        Error syncing pod
Normal      SandboxChanged    Pod sandbox changed, it will be killed and re-created.

The Count column indicates that these errors are being repeated over and over again, roughly once a second. The full output is below from this command is below, but how do I go about debugging this? I'm not even sure what these errors mean.

Name:           ocr-extra-2939512459-3hkv1
Namespace:      ocr-da-cluster
Node:           gke-da-ocr-api-gce-cluster-extra-pool-65029b63-6qs2/10.240.0.11
Start Time:     Tue, 24 Oct 2017 21:05:01 -0400
Labels:         component=ocr
                pod-template-hash=2939512459
                role=extra
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"ocr-da-cluster","name":"ocr-extra-2939512459","uid":"d58bd050-b8f3-11e7-9f9e-4201...
Status:         Pending
IP:
Created By:     ReplicaSet/ocr-extra-2939512459
Controlled By:  ReplicaSet/ocr-extra-2939512459
Containers:
  ocr-node:
    Container ID:
    Image:              us.gcr.io/ocr-api/ocr-image
    Image ID:
    Ports:              80/TCP, 443/TCP, 5555/TCP, 15672/TCP, 25672/TCP, 4369/TCP, 11211/TCP
    State:              Waiting
      Reason:           ContainerCreating
    Ready:              False
    Restart Count:      0
    Requests:
      cpu:      31
      memory:   10Gi
    Liveness:   http-get http://:http/ocr/live delay=270s timeout=30s period=60s #success=1 #failure=5
    Readiness:  http-get http://:http/_ah/warmup delay=180s timeout=60s period=120s #success=1 #failure=3
    Environment:
      NAMESPACE:        ocr-da-cluster (v1:metadata.namespace)
    Mounts:
      /var/log/apache2 from apachelog (rw)
      /var/log/celery from cellog (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dhjr5 (ro)
  log-apache2-error:
    Container ID:
    Image:              busybox
    Image ID:
    Port:               <none>
    Args:
      /bin/sh
      -c
      echo Apache2 Error && sleep 90 && tail -n+1 -F /var/log/apache2/error.log
    State:              Waiting
      Reason:           ContainerCreating
    Ready:              False
    Restart Count:      0
    Requests:
      cpu:              20m
    Environment:        <none>
    Mounts:
      /var/log/apache2 from apachelog (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dhjr5 (ro)
  log-worker-1:
    Container ID:
    Image:              busybox
    Image ID:
    Port:               <none>
    Args:
      /bin/sh
      -c
      echo Celery Worker && sleep 90 && tail -n+1 -F /var/log/celery/worker*.log
    State:              Waiting
      Reason:           ContainerCreating
    Ready:              False
    Restart Count:      0
    Requests:
      cpu:              20m
    Environment:        <none>
    Mounts:
      /var/log/celery from cellog (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-dhjr5 (ro)
Conditions:
  Type          Status
  Initialized   True
  Ready         False
  PodScheduled  True
Volumes:
  apachelog:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  cellog:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  default-token-dhjr5:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-dhjr5
    Optional:   false
QoS Class:      Burstable
Node-Selectors: beta.kubernetes.io/instance-type=n1-highcpu-32
Tolerations:    node.alpha.kubernetes.io/notReady:NoExecute for 300s
                node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  FirstSeen     LastSeen        Count   From                                                        SubObjectPath       Type            Reason                  Message
  ---------     --------        -----   ----                                                        -------------       --------        ------                  -------
  10m           10m             2       default-scheduler                                                       Warning         FailedScheduling        No nodes are available that match all of the following predicates:: Insufficient cpu (10), Insufficient memory (2), MatchNodeSelector (2).
  10m           10m             1       default-scheduler                                                       Normal          Scheduled               Successfully assigned ocr-extra-2939512459-3hkv1 to gke-da-ocr-api-gce-cluster-extra-pool-65029b63-6qs2
  10m           10m             1       kubelet, gke-da-ocr-api-gce-cluster-extra-pool-65029b63-6qs2                    Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "apachelog"
  10m           10m             1       kubelet, gke-da-ocr-api-gce-cluster-extra-pool-65029b63-6qs2                    Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "cellog"
  10m           10m             1       kubelet, gke-da-ocr-api-gce-cluster-extra-pool-65029b63-6qs2                    Normal          SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "default-token-dhjr5"
  10m           1s              382     kubelet, gke-da-ocr-api-gce-cluster-extra-pool-65029b63-6qs2                    Warning         FailedSync              Error syncing pod
  10m           0s              382     kubelet, gke-da-ocr-api-gce-cluster-extra-pool-65029b63-6qs2                    Normal          SandboxChanged          Pod sandbox changed, it will be killed and re-created.

回答1:

Check your resource limits. I faced the same issue and the reason in my case was because I was using m instead of Mi for memory limits and memory requests.



回答2:

Are you sure you need 31 cpu as initial request (ocr-node)?
This will require a very big node.

I'm seeing similar issues with some of my pods. Deleting them and allowing them to be recreated sometimes helps. Not consistent. I'm sure there is enough resources available.

See Kubernetes pods failing on "Pod sandbox changed, it will be killed and re-created"