I'm trying to understand in depth how forwarding from publicly exposed load-balancer's layer-2 VIPs to services' cluster-IPs works. I've read a high-level overview how MetalLB does it and I've tried to replicate it manually by setting keepalived/ucarp VIP and iptables rules. I must be missing something however as it doesn't work ;-]
Steps I took:
created a cluster with
kubeadm
consisting of a master + 3 nodes running k8s-1.17.2 + calico-3.12 on libvirt/KVM VMs on a single computer. all VMs are in192.168.122.0/24
virtual network.created a simple 2 pod deployment and exposed it as a
NodePort
service withexternalTrafficPolicy
set tocluster
:
$ kubectl get svc dump-request NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE dump-request NodePort 10.100.234.120 <none> 80:32292/TCP 65s
I've verified that I can reach it from the host machine on every node's IP at 32292 port.created a VIP with
ucarp
on all 3 nodes:
ucarp -i ens3 -s 192.168.122.21 -k 21 -a 192.168.122.71 -v 71 -x 71 -p dump -z -n -u /usr/local/sbin/vip-up.sh -d /usr/local/sbin/vip-down.sh
(example from knode1)
I've verified that I can ping the192.168.122.71
VIP. I even could ssh through it to the VM that was currently holding the VIP.
Now if kube-proxy was iniptables
mode, I could also reach the service on its node-port through the VIP athttp://192.168.122.71:32292
. However, to my surprise, inipvs
mode this always resulted in connection timing out.added an iptables rule on every node for packets incoming to
192.168.122.71
to be forwarded to to service's cluster-IP10.100.234.120
:
iptables -t nat -A PREROUTING -d 192.168.122.71 -j DNAT --to-destination 10.100.234.120
(later I've also tried to narrow the rule only to the relevant port, but it didn't change the results in any way:
iptables -t nat -A PREROUTING -d 192.168.122.71 -p tcp --dport 80 -j DNAT --to-destination 10.100.234.120:80
)
Results:
in iptables
mode all requests to http://192.168.122.71:80/
resulted in connection timing out.
in ipvs
mode it worked partially:
if the 192.168.122.71
VIP was being held by a node that had a pod on it, then about 50% requests were succeeding and they were always served by the local pod. the app was also getting the real remote IP of the host machine (192.168.122.1
). the other 50% (being sent to the pod on anther node presumably) were timing out.
if the VIP was being held by a node without pods then all requests were timing out.
I've also checked if it affects the results in anyway to keep the rule on all nodes at all times vs. to keep it only on the node holding the VIP and deleting it at the release of the VIP: results were the same in both cases.
Does anyone know why it doesn't work and how to fix it? I will appreciate help with this :)
need to add
MASQUERADE
rule also, so that the source is changed accordingly. for example:iptables -t nat -A POSTROUTING -j MASQUERADE
tested with
ipvs