Tracking down a memory leak / garbage-collection i

2019-01-09 23:57发布

This is a problem I have been trying to track down for a couple months now. I have a java app running that processes xml feeds and stores the result in a database. There have been intermittent resource problems that are very difficult to track down.

Background: On the production box (where the problem is most noticeable), i do not have particularly good access to the box, and have been unable to get Jprofiler running. That box is a 64bit quad-core, 8gb machine running centos 5.2, tomcat6, and java 1.6.0.11. It starts with these java-opts

JAVA_OPTS="-server -Xmx5g -Xms4g -Xss256k -XX:MaxPermSize=256m -XX:+PrintGCDetails -
XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution -XX:+UseParNewGC"

The technology stack is the following:

  • Centos 64-bit 5.2
  • Java 6u11
  • Tomcat 6
  • Spring/WebMVC 2.5
  • Hibernate 3
  • Quartz 1.6.1
  • DBCP 1.2.1
  • Mysql 5.0.45
  • Ehcache 1.5.0
  • (and of course a host of other dependencies, notably the jakarta-commons libraries)

The closest I can get to reproducing the problem is a 32-bit machine with lower memory requirements. That I do have control over. I have probed it to death with JProfiler and fixed many performance problems (synchronization issues, precompiling/caching xpath queries, reducing the threadpool, and removing unnecessary hibernate pre-fetching, and overzealous "cache-warming" during processing).

In each case, the profiler showed these as taking up huge amounts of resources for one reason or another, and that these were no longer primary resource hogs once the changes went in.

The Problem: The JVM seems to completely ignore the memory usage settings, fills all memory and becomes unresponsive. This is an issue for the customer facing end, who expects a regular poll (5 minute basis and 1-minute retry), as well for our operations teams, who are constantly notified that a box has become unresponsive and have to restart it. There is nothing else significant running on this box.

The problem appears to be garbage collection. We are using the ConcurrentMarkSweep (as noted above) collector because the original STW collector was causing JDBC timeouts and became increasingly slow. The logs show that as the memory usage increases, that is begins to throw cms failures, and kicks back to the original stop-the-world collector, which then seems to not properly collect.

However, running with jprofiler, the "Run GC" button seems to clean up the memory nicely rather than showing an increasing footprint, but since I can not connect jprofiler directly to the production box, and resolving proven hotspots doesnt seem to be working I am left with the voodoo of tuning Garbage Collection blind.

What I have tried:

  • Profiling and fixing hotspots.
  • Using STW, Parallel and CMS garbage collectors.
  • Running with min/max heap sizes at 1/2,2/4,4/5,6/6 increments.
  • Running with permgen space in 256M increments up to 1Gb.
  • Many combinations of the above.
  • I have also consulted the JVM [tuning reference](http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html) , but can't really find anything explaining this behavior or any examples of _which_ tuning parameters to use in a situation like this.
  • I have also (unsuccessfully) tried jprofiler in offline mode, connecting with jconsole, visualvm, but I can't seem to find anything that will interperet my gc log data.

Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up.

Can anyone give any advice as to:
a) Why a JVM is using 8 physical gigs and 2 gb of swap space when it is configured to max out at less than 6.
b) A reference to GC tuning that actually explains or gives reasonable examples of when and what kind of setting to use the advanced collections with.
c) A reference to the most common java memory leaks (i understand unclaimed references, but I mean at the library/framework level, or something more inherenet in data structures, like hashmaps).

Thanks for any and all insight you can provide.

EDIT
Emil H:
1) Yes, my development cluster is a mirror of production data, down to the media server. The primary difference is the 32/64bit and the amount of RAM available, which I can't replicate very easily, but the code and queries and settings are identical.

2) There is some legacy code that relies on JaxB, but in reordering the jobs to try to avoid scheduling conflicts, I have that execution generally eliminated since it runs once a day. The primary parser uses XPath queries which call down to the java.xml.xpath package. This was the source of a few hotspots, for one the queries were not being pre-compiled, and two the references to them were in hardcoded strings. I created a threadsafe cache (hashmap) and factored the references to the xpath queries to be final static Strings, which lowered resource consumption significantly. The querying still is a large part of the processing, but it should be because that is the main responsibility of the application.

3) An additional note, the other primary consumer is image operations from JAI (reprocessing images from a feed). I am unfamiliar with java's graphic libraries, but from what I have found they are not particularly leaky.

(thanks for the answers so far, folks!)

UPDATE:
I was able to connect to the production instance with VisualVM, but it had disabled the GC visualization / run-GC option (though i could view it locally). The interesting thing: The heap allocation of the VM is obeying the JAVA_OPTS, and the actual allocated heap is sitting comfortably at 1-1.5 gigs, and doesnt seem to be leaking, but the box level monitoring still shows a leak pattern, but it is not reflected in the VM monitoring. There is nothing else running on this box, so I am stumped.

7条回答
你好瞎i
2楼-- · 2019-01-10 00:57

I had the same problem, with couple of differences..

My technology is the following:

grails 2.2.4

tomcat7

quartz-plugin 1.0

I use two datasources on my application. That is a particularity determinant to bug causes..

Another thing to consider is that quartz-plugin, inject hibernate session in quartz threads, just like @liam says, and quartz threads still alive, untill I finish application.

My problem was a bug on grails ORM combined with the way the plugin handle session and my two datasources.

Quartz plugin had a listener to init and destroy hibernate sessions

public class SessionBinderJobListener extends JobListenerSupport {

    public static final String NAME = "sessionBinderListener";

    private PersistenceContextInterceptor persistenceInterceptor;

    public String getName() {
        return NAME;
    }

    public PersistenceContextInterceptor getPersistenceInterceptor() {
        return persistenceInterceptor;
    }

    public void setPersistenceInterceptor(PersistenceContextInterceptor persistenceInterceptor) {
        this.persistenceInterceptor = persistenceInterceptor;
    }

    public void jobToBeExecuted(JobExecutionContext context) {
        if (persistenceInterceptor != null) {
            persistenceInterceptor.init();
        }
    }

    public void jobWasExecuted(JobExecutionContext context, JobExecutionException exception) {
        if (persistenceInterceptor != null) {
            persistenceInterceptor.flush();
            persistenceInterceptor.destroy();
        }
    }
}

In my case, persistenceInterceptor instances AggregatePersistenceContextInterceptor, and it had a List of HibernatePersistenceContextInterceptor. One for each datasource.

Every opertion do with AggregatePersistenceContextInterceptor its passed to HibernatePersistence, without any modification or treatments.

When we calls init() on HibernatePersistenceContextInterceptor he increment the static variable below

private static ThreadLocal<Integer> nestingCount = new ThreadLocal<Integer>();

I don't know the pourpose of that static count. I just know he it's incremented two times, one per datasource, because of the AggregatePersistence implementation.

Until here I just explain the cenario.

The problem comes now...

When my quartz job finish, the plugin calls the listener to flush and destroy hibernate sessions, like you can see in source code of SessionBinderJobListener.

The flush occurs perfectly, but the destroy not, because HibernatePersistence, do one validation before close hibernate session... It examines nestingCount to see if the value is grather than 1. If the answer is yes, he not close the session.

Simplifying what was did by Hibernate:

if(--nestingCount.getValue() > 0)
    do nothing;
else
    close the session;

That's the base of my memory leak.. Quartz threads still alive with all objects used in session, because grails ORM not close session, because of a bug caused because I have two datasources.

To solve that, I customize the listener, to call clear before destroy, and call destroy two times, (one for each datasource). Ensuring my session was clear and destroyed, and if the destroy fails, he was clear at least.

查看更多
登录 后发表回答