I accidentally drained all nodes in Kubernetes (even master). How can I bring my Kubernetes back? kubectl is not working anymore:
kubectl get nodes
Result:
The connection to the server 172.16.16.111:6443 was refused - did you specify the right host or port?
Here is the output of systemctl status kubelet
on master node (node1):
● kubelet.service - Kubernetes Kubelet Server
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2020-06-23 21:42:39 UTC; 25min ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 15541 (kubelet)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/kubelet.service
└─15541 /usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=172.16.16.111 --hostname-override=node1 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.1 --runtime-cgroups=/systemd/system.slice --cpu-manager-policy=static --kube-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --system-reserved=cpu=1,memory=2Gi,ephemeral-storage=1Gi --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330009 15541 kubelet_node_status.go:286] Setting node annotation to enable volume controller attach/detach
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.330201 15541 setters.go:73] Using node IP: "172.16.16.111"
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331475 15541 kubelet_node_status.go:472] Recording NodeHasSufficientMemory event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331494 15541 kubelet_node_status.go:472] Recording NodeHasNoDiskPressure event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331500 15541 kubelet_node_status.go:472] Recording NodeHasSufficientPID event message for node node1
Jun 23 22:08:34 node1 kubelet[15541]: I0623 22:08:34.331661 15541 policy_static.go:244] [cpumanager] static policy: RemoveContainer (container id: 6dd59735cabf973b6d8b2a46a14c0711831daca248e918bfcfe2041420931963)
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.332058 15541 pod_workers.go:191] Error syncing pod 93ff1a9840f77f8b2b924a85815e17fe ("kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"), skipping: failed to "StartContainer" for "kube-apiserver" with CrashLoopBackOff: "back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-node1_kube-system(93ff1a9840f77f8b2b924a85815e17fe)"
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.427587 15541 kubelet.go:2267] node "node1" not found
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.506152 15541 reflector.go:123] k8s.io/kubernetes/pkg/kubelet/kubelet.go:450: Failed to list *v1.Service: Get https://172.16.16.111:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.16.16.111:6443: connect: connection refused
Jun 23 22:08:34 node1 kubelet[15541]: E0623 22:08:34.527813 15541 kubelet.go:2267] node "node1" not found
I'm using Ubuntu 18.04, and there are 7 compute nodes in my cluster. All drained (accidentally, kind of!)! I've installed my K8s cluster using Kubespray.
Is there any way to uncordon any of these nodes? So that k8s necessary pods can be scheduled.
Any help would be appreciated.
Update:
I asked a seperate question about how to connect to etcd here: Can't connect to the ETCD of Kubernetes
If you have production or 'live' workloads, the best safe approach is to provision a new cluster and switch the workloads gradually.
Kubernetes keeps its state in etcd so you could potentially connect to etcd and clear the 'drained' state but you will probably have to look at the source code and see where that happens and where the specific key/values are stored in etcd.
The logs that you shared are basically showing that the kube-apiserver cannot start so it's likely that it's trying to connect to etcd/startup and etcd is telling it: "you cannot start on this node because it has been drained".
The typical startup sequence for the masters is something like this:
You can also follow any guide to connect to etcd and see if you can troubleshoot any further. For example, this one. Then you could examine/delete some of the node keys at your own risk: