Occasionally, I see an issue where a pod will start up without network connectivity. Because of this, the pod goes into a CrashLoopBackOff and is unable to recover. The only way I am able to get the pod running again is by running a kubectl delete pod
and waiting for it to reschedule. Here's an example of a liveness probe failing due to this issue:
Liveness probe failed: Get http://172.20.78.9:9411/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I've also noticed that there are no iptables entries for the pod IP when this happens. When the pod is deleted and rescheduled (and is in a working state) I have the iptables entries.
If I turn off the livenessprobe in the container and exec into it, I confirmed it has no network connectivity to the cluster or the local network or internet.
Would like to hear any suggestions as to what it could be or what else I can look into to further troubleshoot this scenario.
Currently running:
Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.7",
GitCommit:"92b4f971662de9d8770f8dcd2ee01ec226a6f6c0",
GitTreeState:"clean", BuildDate:"2016-12-10T04:49:33Z",
GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.7",
GitCommit:"92b4f971662de9d8770f8dcd2ee01ec226a6f6c0",
GitTreeState:"clean", BuildDate:"2016-12-10T04:43:42Z",
GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
OS:
NAME=CoreOS
ID=coreos
VERSION=1235.0.0
VERSION_ID=1235.0.0
BUILD_ID=2016-11-17-0416
PRETTY_NAME="CoreOS 1235.0.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
In response to freehan (https://stackoverflow.com/users/7577983/freehan)
We are using the default network plugin which as you pointed out is the native docker one.
Regarding the suggestion to use tcpdump to capture the packet's path. Do you know an easy way to determine which veth is associated with a given pod?
I plan on running a container that has tcpdump installed and watch the traffic on the veth associated with the problem pod while initiating a outbound network traffic from the pod (eg: ping, dig, curl or whatever is available in the given pod).
Let me know if you had something else in mind and I will try that.
Looks like your network driver is not working properly. Without more information about your setup, I could only suggest you the following:
--network-plugin
flag. If no network plugin is specified, then it is using native docker network.I don't have enough points to comment so this answer is in response to Prashanth B (https://stackoverflow.com/users/5446771/prashanth-b)
Let me describe "without network connectivity" in more detail. When I exec into one of the pods that is suffering from the originally described symptoms this is what sort of network issues I see.
In this example we have a pod which is suffering from what appears to be a pod without any network connectivity.
First I ping the routeable ip of the physical node (eth0 interface) from the pod. This works from pods on the same node that are working normally.
Trying internal or external DNS resolution. I don't expect the ping's to work but this is the only available tool in the container to do name resolution. I can't install anything else because of no networking.
From another pod in the same cluster and on the same physical node as the not working pod I will attempt to connect to a port that is open on the pod.
From the physical node I cannot connect the pod ip on port 80
I looked through the troubleshooting guide at https://kubernetes.io/docs/user-guide/debugging-services/ but that guide is targeted at diagnosing problems connecting a kubernetes service to one or more pods. In my scenario we experience an unpredictable behavior with the creation of a pod which is not service specific. For example we are seeing this 1 - 3 times a week across 3 different clusters spanning dozens of 'deployments'. It's never the same deployment which has the problem and our only recourse is to delete the pod after which it get's instantiated correctly.
I have gone through the relevant pieces of the troubleshooting guide and posted them here.
Here we see that kubelet and kube-proxy are running
I've verified kube-proxy is proxying by hitting other pods on this same node.
A new version of the app just got deployed and I lost my pod that I was troubleshooting with. I will start preparing some additional commands to run when this symptom occurs again. I will also run a high volume of deployment creations since the number of bad pods we get is in relation to the volume of new pods that are being created.
I am thinking that we are hitting this bug https://github.com/coreos/bugs/issues/1785. I've verified that I can reproduce the bug listed on our version of docker/coreos. Will coreos/docker and verify.