I am very new to kubernetes and have just got a stock kubernetes v.1.3.5 cluster up on AWS using kube-up. So far, I have been playing around with kubernetes in understanding it's mechanics (nodes, pods, svc and stuff). Based on my initial (or maybe crude) understanding , I had few questions:
1) How does routing to cluster IP work here (i.e in kube-aws) ? I see that the services have IPs in the range 10.0.0.0/16. I did a deployment with rc=3 of stock nginx and then attached a service to it with Node Port exposed. All works great! I can connect to the service from my dev machine. This nginx service has a cluster IP of 10.0.33.71:1321. Now, if I ssh into one of the minions(or nodes or VMS) and do a "telnet 10.0.33.71 1321", it connects as expected. But I am clueless how this works, I couldn't find any routes related to 10.0.0.0/16 in the VPC setup by kubernetes. What exactly happens under the hood here that results in a successful connection for app like telnet? However, If I ssh into the master node and do "telnet 10.0.33.71 1321", it does not connect. Why does it fail to connect from master?
2) There is a cbr0 interface inside each node. Each minion node has cbr0 configured as 10.244.x.0/24 and master has cbr0 as 10.246.0.0/24.
I can ping to any of the 10.244.x.x pods from any of the nodes(including master). But I am not able to ping 10.246.0.1 (cbr0 inside master node) from any of the minion nodes. What could be happening here?
Here's the routes set up by kubernetes in aws. VPC.
Destination Target
172.20.0.0/16 local
0.0.0.0/0 igw-<hex value>
10.244.0.0/24 eni-<hex value> / i-<hex value>
10.244.1.0/24 eni-<hex value> / i-<hex value>
10.244.2.0/24 eni-<hex value> / i-<hex value>
10.244.3.0/24 eni-<hex value> / i-<hex value>
10.244.4.0/24 eni-<hex value> / i-<hex value>
10.246.0.0/24 eni-<hex value> / i-<hex value>
Mark Betz (SRE at Olark) presents Kubernetes networking in three articles:
For a pod, you are looking at:
You find:
- etho0: a "physical network interface"
- docker0/cbr0: a bridge for connecting two ethernet segments no matter their protocol.
veth0
, 1
, 2
: Virtual Network Interface, one per container.
docker0 is the default Gateway of veth0. It is named cbr0 for "custom bridge".
Kubernetes starts containers by sharing the same veth0, which means each container must expose different ports.
- pause: a special container started in "
pause
", to detect SIGTERM sent to a pod, and forward it to the containers.
- node: an host
- cluster: a group of nodes
- router/gateway
The last element is where things start to be more complex:
Kubernetes assigns an overall address space for the bridges on each node, and then assigns the bridges addresses within that space, based on the node the bridge is built on.
Secondly, it adds routing rules to the gateway at 10.100.0.1 telling it how packets destined for each bridge should be routed, i.e. which node’s eth0
the bridge can be reached through.
Such a combination of virtual network interfaces, bridges, and routing rules is usually called an overlay network.
When a pod contacts another pod, it goes through a service.
Why?
Pod networking in a cluster is neat stuff, but by itself it is insufficient to enable the creation of durable systems. That’s because pods in Kubernetes are ephemeral.
You can use a pod IP address as an endpoint but there is no guarantee that the address won’t change the next time the pod is recreated, which might happen for any number of reasons.
That means: you need a reverse-proxy/dynamic load-balancer. And it better be resilient.
A service is a type of kubernetes resource that causes a proxy to be configured to forward requests to a set of pods.
The set of pods that will receive traffic is determined by the selector, which matches labels assigned to the pods when they were created
That service uses its own network. By default, its type is "ClusterIP"; it has its own IP.
Here is the communication path between two pods:
It uses a kube-proxy.
This proxy uses itself a netfilter.
netfilter is a rules-based packet processing engine.
It runs in kernel space and gets a look at every packet at various points in its life cycle.
It matches packets against rules and when it finds a rule that matches it takes the specified action.
Among the many actions it can take is redirecting the packet to another destination.
In this mode, kube-proxy:
- opens a port (10400 in the example above) on the local host interface to listen for requests to the test-service,
- inserts netfilter rules to reroute packets destined for the service IP to its own port, and
- forwards those requests to a pod on port 8080.
That is how a request to 10.3.241.152:80
magically becomes a request to 10.0.2.2:8080
.
Given the capabilities of netfilter all that’s required to make this all work for any service is for kube-proxy to open a port and insert the correct netfilter rules for that service, which it does in response to notifications from the master api server of changes in the cluster.
But:
There’s one more little twist to the tale.
I mentioned above that user space proxying is expensive due to marshaling packets.
In kubernetes 1.2, kube-proxy gained the ability to run in iptables mode.
In this mode, kube-proxy mostly ceases to be a proxy for inter-cluster connections, and instead delegates to netfilter the work of detecting packets bound for service IPs and redirecting them to pods, all of which happens in kernel space.
In this mode kube-proxy’s job is more or less limited to keeping netfilter rules in sync.
The network schema becomes:
However, this is not a good fit for external (public facing) communication, which needs an external fixed IP.
You have dedicated services for that: nodePort and LoadBalancer:
A service of type NodePort is a ClusterIP service with an additional capability: it is reachable at the IP address of the node as well as at the assigned cluster IP on the services network.
The way this is accomplished is pretty straightforward:
When kubernetes creates a NodePort service, kube-proxy allocates a port in the range 30000–32767 and opens this port on the eth0
interface of every node (thus the name “NodePort”).
Connections to this port are forwarded to the service’s cluster IP.
You get:
A Loadalancer is more advancer, and allows to expose services using stand ports.
See the mapping here:
$ kubectl get svc service-test
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
openvpn 10.3.241.52 35.184.97.156 80:32213/TCP 5m
However:
Services of type LoadBalancer have some limitations.
- You cannot configure the lb to terminate https traffic.
- You can’t do virtual hosts or path-based routing, so you can’t use a single load balancer to proxy to multiple services in any practically useful way.
These limitations led to the addition in version 1.2 of a separate kubernetes resource for configuring load balancers, called an Ingress.
The Ingress API supports TLS termination, virtual hosts, and path-based routing. It can easily set up a load balancer to handle multiple backend services.
The implementation follows a basic kubernetes pattern: a resource type and a controller to manage that type.
The resource in this case is an Ingress, which comprises a request for networking resources
For instance:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: test-ingress
annotations:
kubernetes.io/ingress.class: "gce"
spec:
tls:
- secretName: my-ssl-secret
rules:
- host: testhost.com
http:
paths:
- path: /*
backend:
serviceName: service-test
servicePort: 80
The ingress controller is responsible for satisfying this request by driving resources in the environment to the necessary state.
When using an Ingress you create your services as type NodePort and let the ingress controller figure out how to get traffic to the nodes.
There are ingress controller implementations for GCE load balancers, AWS elastic load balancers, and for popular proxies such as NGiNX and HAproxy.