I have a working Cluster with services that all respond behind a helm installed Ingress nGinx running on Azure AKS. This ended up being Azure specific.
My question is: Why does my connection to the services / pods in this cluster periodically get severed (apparently by some sort of idle timeout), and why does that connection severing appear to also coincide with my Az AKS Browse UI connection getting cut?
This is an effort to get a final answer on what exactly triggers the time-out that causes the local 'Browse' proxy UI to disconnect from my Cluster (more background on why I am asking to follow).
When working with Azure AKS from the Az CLI you can launch the local Browse UI from the terminal using:
az aks browse --resource-group <resource-group> --name <cluster-name>
This works fine and pops open a browser window that looks something like this (yay):
In your terminal you will see something along the lines of:
- Proxy running on http://127.0.0.1:8001/ Press CTRL+C to close the tunnel...
- Forwarding from 127.0.0.1:8001 -> 9090 Forwarding from
- [::1]:8001 -> 9090 Handling connection for 8001 Handling connection for 8001 Handling connection for 8001
If you leave the connection to your Cluster idle for a few minutes (ie. you don't interact with the UI) you should see the following print to indicate that the connection has timed out:
E0605 13:39:51.940659 5704 portforward.go:178] lost connection to pod
One thing I still don't understand is whether OTHER activity inside of the Cluster can prolong this timeout but regardless once you see the above you are essentially at the same place I am... which means we can talk about the fact that it looks like all of my other connections OUT from pods in that server have also been closed by whatever timeout process is responsible for cutting ties with the AKS browse UI.
So what's the issue?
The reason this is a problem for me is that I have a Service running a Ghost Blog pod which connects to a remote MySQL database using an npm package called 'Knex'. As it happens the newer versions of Knex have a bug (which has yet to be addressed) whereby if a connection between the Knex client and a remote db server is cut and needs to be restored — it doesn't re-connect and just infinitely loads.
nGinx Error 503 Gateway Time-out
In my situation that resulted in nGinx Ingress giving me an Error 503 Gateway time-out. This was because Ghost wasn't responding after the Idle timeout cut the Knex connection — since Knex wasn't working properly and doesn't restore the broken connection to the server properly.
Fine. I rolled back Knex and everything works great.
But why the heck are my pod connections being severed from my Database to begin with?
Hence this question to hopefully save some future person days of attempting to troubleshoot phantom issues that relate back to Kubernetes (maybe Azure specific, maybe not) cutting connections after a service / pod has been idle for some time.
Short Answer:
That idle timeout is both standard AND required (although you MAY be able to modify it, see here: https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout). That being said there is no way to ELIMINATE it entirely for any traffic that is heading externally OUT from the Load Balancer IP — the longest duration currently supported is 30 minutes.
There is no native Azure way to get around an idle connection being cut.
So as per the original question, the best way (I feel) to handle this is to leave the timeout at 4 minutes (since it has to exist anyway) and then setup your infrastructure to disconnect your connections in a graceful way (when idle) prior to hitting the Load Balancer timeout.
Our Solutions
For our Ghost Blog (which hit a MySQL database) I was able to roll back as mentioned above which made the Ghost process able to handle a DB disconnect / reconnect scenario.
What about Rails?
Yep. Same problem.
For a separate Rails based app we also run on AKS which is connecting to a remote Postgres DB (not on Azure) we ended up implementing PGbouncer (https://github.com/pgbouncer/pgbouncer) as an additional container on our Cluster via the awesome directions found here: https://github.com/edoburu/docker-pgbouncer/tree/master/examples/kubernetes/singleuser
Generally anyone attempting to access a remote database FROM AKS is probably going to need to implement an intermediary connection pooling solution. The pooling service sits in the middle (PGbouncer for us) and keeps track of how long a connection has been idle so that your worker processes don't need to care about that.
If you start to approach the Load Balancer timeout the connection pooling service will throw out the old connection and make a new fresh one (resetting the timer). That way when your client sends data down the pipe it lands on your Database server as anticipated.
In closing
This was an INSANELY frustrating bug / case to track down. We burned at least 2 dev-ops days figuring the first solution out but even KNOWING that it was probably the same issue we burned another 2 days this time around.
Even elongating the timer beyond the 4 minute default wouldn't really help since that would just make the problem more ephemeral to troubleshoot. I guess I just hope that anyone who has trouble connecting from Azure AKS / Kubernetes to a remote db is better at googling than I am and can save themselves some pain.
Thanks to MSFT Support (Kris you are the best) for the hint on the LB timer and to the dude who put together PGbouncer in a container so I didn't have to reinvent the wheel.