We're running a Jersey (1.x) based service in Tomcat on AWS in an array of ~20 instances Periodically an instance "goes bad": over the course of about 4 hours its heap and CPU usage increase until the heap is exhausted and the CPU is pinned. At that point it gets automatically removed from the load balancer and eventually killed.
Examining heap dumps from these instances, ~95% of the memory has been used up by an instance of java.lang.ref.Finalizer which is holding onto all sorts of stuff, but most or all of it is related to HTTPS connections sun.net.www.protocol.https.HttpsURLConnectionImpl, sun.security.ssl.SSLSocketImpl, various crypto objects). These are connections that we're making to an external webservice using Jersey's client library. A heap dump from a "healthy" instance doesn't indicate any sort of issue.
Under relatively low load instances run for days or weeks without issue. As load increases, so does the frequency of instance failure (several per day by the time average CPU gets to ~40%).
Our JVM args are:
-XX:+UseG1GC -XX:MaxPermSize=256m -Xmx1024m -Xms1024m
I'm in the process of adding JMX logging for garbage collection metrics, but I'm not entirely clear what I should be looking for. At this point I'm primarily looking for ideas of what could kick off this sort of failure or additional targets for investigation.