Method for finding memory leak in large Java heap

2019-01-30 01:34发布

I have to find a memory leak in a Java application. I have some experience with this but would like advice on a methodology/strategy for this. Any reference and advice is welcome.

About our situation:

  1. Heap dumps are larger than 1 GB
  2. We have heap dumps from 5 occasions.
  3. We don't have any test case to provoke this. It only happens in the (massive) system test environment after at least a weeks usage.
  4. The system is built on a internally developed legacy framework with so many design flaws that they are impossible to count them all.
  5. Nobody understands the framework in depth. It has been transfered to one guy in India who barely keeps up with answering e-mails.
  6. We have done snapshot heap dumps over time and concluded that there is not a single component increasing over time. It is everything that grows slowly.
  7. The above points us in the direction that it is the frameworks homegrown ORM system that increases its usage without limits. (This system maps objects to files?! So not really a ORM)

Question: What is the methodology that helped you succeed with hunting down leaks in a enterprise scale application?

7条回答
We Are One
2楼-- · 2019-01-30 02:08

This answer expands upon @Will-Hartung's. I applied to same process to diagnose one of my memory leaks and thought that sharing the details would save other people time.

The idea is to have postgres 'plot' time vs. memory usage of each class, draw a line that summarizes the growth and identify the objects that are growing the fastest:

    ^
    |
s   |  Legend:
i   |  *  - data point
z   |  -- - trend
e   |
(   |
b   |                 *
y   |                     --
t   |                  --
e   |             * --    *
s   |           --
)   |       *--      *
    |     --    *
    |  -- *
   --------------------------------------->
                      time

Convert your heap dumps (need multiple) into a format this is convenient for consumption by postgres from the heap dump format:

 num     #instances         #bytes  class name 
----------------------------------------------
   1:       4632416      392305928  [C
   2:       6509258      208296256  java.util.HashMap$Node
   3:       4615599      110774376  java.lang.String
   5:         16856       68812488  [B
   6:        278914       67329632  [Ljava.util.HashMap$Node;
   7:       1297968       62302464  
...

To a csv file with a the datetime of each heap dump:

2016.09.20 17:33:40,[C,4632416,392305928
2016.09.20 17:33:40,java.util.HashMap$Node,6509258,208296256
2016.09.20 17:33:40,java.lang.String,4615599,110774376
2016.09.20 17:33:40,[B,16856,68812488
...

Using this script:

# Example invocation: convert.heap.hist.to.csv.pl -f heap.2016.09.20.17.33.40.txt -dt "2016.09.20 17:33:40"  >> heap.csv 

 my $file;
 my $dt;
 GetOptions (
     "f=s" => \$file,
     "dt=s" => \$dt
 ) or usage("Error in command line arguments");
 open my $fh, '<', $file or die $!;

my $last=0;
my $lastRotation=0;
 while(not eof($fh)) {
     my $line = <$fh>;
     $line =~ s/\R//g; #remove newlines
     #    1:       4442084      369475664  [C
     my ($instances,$size,$class) = ($line =~ /^\s*\d+:\s+(\d+)\s+(\d+)\s+([\$\[\w\.]+)\s*$/) ;
     if($instances) {
         print "$dt,$class,$instances,$size\n";
     }
 }

 close($fh);

Create a table to put the data in

CREATE TABLE heap_histogram (
    histwhen timestamp without time zone NOT NULL,
    class character varying NOT NULL,
    instances integer NOT NULL,
    bytes integer NOT NULL
);

Copy the data into your new table

\COPY heap_histogram FROM 'heap.csv'  WITH DELIMITER ',' CSV ;

Run the slop query against size (num of bytes) query:

SELECT class, REGR_SLOPE(bytes,extract(epoch from histwhen)) as slope
    FROM public.heap_histogram
    GROUP BY class
    HAVING REGR_SLOPE(bytes,extract(epoch from histwhen)) > 0
    ORDER BY slope DESC
    ;

Interpret the results:

         class             |        slope         
---------------------------+----------------------
 java.util.ArrayList       |     71.7993806279174
 java.util.HashMap         |     49.0324576155785
 java.lang.String          |     31.7770770326123
 joe.schmoe.BusinessObject |     23.2036817108056
 java.lang.ThreadLocal     |     20.9013528767851

The slope is bytes added per second (since the unit of epoch is in seconds). If you use instances instead of size, then that's the number of instances added per second.

My one of the lines of code creating this joe.schmoe.BusinessObject was responsible for the memory leak. It was creating the object, appending it to an array without checking if it already existed. The other objects were also created along with the BusinessObject near the leaking code.

查看更多
孤傲高冷的网名
3楼-- · 2019-01-30 02:11

I've used jhat, this is a bit harsh, but it depends on the kind of framework you had.

查看更多
做个烂人
4楼-- · 2019-01-30 02:17

Can you accelerate time? i.e. can you write a dummy test client that forces it to do a weeks worth of calls/requests etc in a few minutes or hours? These are your biggest friend and if you don't have one - write one.

We used Netbeans a while ago to analyse heap dumps. It can be a bit slow but it was effective. Eclipse just crashed and the 32bit Windows tools did as well.

If you have access to a 64bit system or a Linux system with 3GB or more you will find it easier to analyse the heap dumps.

Do you have access to change logs and incident reports? Large scale enterprises will normally have change management and incident management teams and this may be useful in tracking down when problems started happening.

When did it start going wrong? Talk to people and try and get some history. You may get someone saying, "Yeah, it was after they fixed XYZ in patch 6.43 that we got weird stuff happening".

查看更多
ら.Afraid
5楼-- · 2019-01-30 02:23

I've had success with IBM Heap Analyzer. It offers several views of the heap, including largest drop-off in object size, most frequently occurring objects, and objects sorted by size.

查看更多
爷的心禁止访问
6楼-- · 2019-01-30 02:24

Take a look at Eclipse Memory Analyzer. It's a great tool (and self contained, does not require Eclipse itself installed) which 1) can open up very large heaps very fast and 2) has some pretty good automatic detection tools. The latter isn't perfect, but EMA provides a lot of really nice ways to navigate through and query the objects in the dump to find any possible leaks.

I've used it in the past to help hunt down suspicious leaks.

查看更多
放荡不羁爱自由
7楼-- · 2019-01-30 02:27

If it's happening after a week's usage, and your application is as byzantine as you describe, perhaps you're better off restarting it every week ?

I know it's not fixing the problem, but it may be a time-effective solution. Are there time windows when you can have outages ? Can you load balance and fail over one instance whilst keeping the second up ? Perhaps you can trigger a restart when memory consumption breaches a certain limit (perhaps monitoring via JMX or similar).

查看更多
登录 后发表回答