Java Garbage Collector - Not running normally at r

2020-06-01 05:21发布

问题:

I have a program that is constantly running. Normally, it seems to garbage collect, and remain under about 8MB of memory usage. However, every weekend, it refuses to garbage collect unless I make an explicit call to it. However, if it nears the maximum heap size, it will still garbage collect. However the only reason this issue was noticed, is because it actually crashed from running out of memory on one weekend i.e. it must have reached the maximum heap size, and not run the garbage collector.

The following image (click to see) is a graph of the program's memory usage over a day. On the sides of the graph, you can see the normal behaviour of the program's memory usage, but the first large peak is what seems to start over the weekend. This particular graph is a strange example, because after I made an explicit call to the garbage collector, it ran successfully, but then it went and climbed back to the maximum heap size and successfully garbage collected on it's own twice.

What is going on here?

EDIT:

Ok, from the comments, it seems I haven't provided enough information. The program simply receives a stream of UDP packets, which are placed in a queue (set to have a maximum size of 1000 objects), which are then processed to have their data stored in a database. On average, it gets about 80 packets per second, but can peak to 150. It's running under Windows Server 2008.

The thing is, this activity is fairly consistent, and if anything, at the time that the memory usage starts it's steady climb, the activity should be lower, not higher. Mind you, the graph I posted above is the only one I have that extends back that far, since I only changed the Java Visual VM wrapper to keep graph data back far enough to see it this week, so I have no idea if it's exactly the same time every week, because I can't watch it over the weekend, as it's on a private network, and I'm not at work on the weekend.

Here is a graph of the next day:

This is pretty much what the memory usage looks like every other day of the week. The program is never restarted, and we only tell it to garbage collect on a Monday morning because of this issue. One week we tried restarting it on a Friday afternoon, and it still started climbing sometime over the weekend, so the time that we restart it doesn't seem to have anything to do with the memory usage next week.

The fact that it successfully garbage collects all those objects when we tell it to implies to me that the objects are collectable, it's just not doing it until it reaches the maximum heap size, or we explicitly call the garbage collector. A heap dump doesn't tell us anything, because when we try to perform one, it suddenly runs the garbage collector, and then outputs a heap dump, which of course looks perfectly normal at this point.

So I suppose I have two questions: Why is it suddenly not garbage collecting the way it does the rest of the week, and why is it that on one occassion, the garbage collection that occurs when it reaches the maximum heap size was unable to collect all those objects (i.e. why would there be references to so many objects that one time, when every other time there must not be)?

UPDATE:

This morning has been an interesting one. As I mentioned in the comments, the program is running on a client's system. Our contact in the client organisation reports that at 1am, this program failed, and he had to restart it manually when he got into work this morning, and that once again, the server time was incorrect. This is an issue we've had with them in the past, but until now, the issue never seemed to be related.

Looking through the logs that our program produces, we can deduce the following information:

  1. At 01:00, the server has somehow resynced it's time, setting it to 00:28.
  2. At 00:45 (according to the new, incorrect server time), one of the message processing threads in the program has thrown an out of memory error.
  3. However, the other message processing thread (there are two types of messages we receive, they are processed slightly differently, but they are both constantly coming in), continues to run, and as usual, the memory usage continues to climb with no garbage collection (as seen from the graphs we have been recording, once again).
  4. At 00:56, the logs stop, until about 7am when the program was restarted by our client. However, the memory usage graph, for this time, was still steadily increasing.

Unfortunately, because of the change in server time, this makes the times on our memory usage graph unreliable. However, it seems to be that it tried to garbage collect, failed, increased the heap space to the maximum available size, and killed that thread all at once. Now that the maximum heap space has increased, it's happy to use all of it without performing a major garbage collection.

So now I ask this: if the server time changes suddenly like it did, can that cause a problem with the garbage collection process?

回答1:

However the only reason this issue was noticed, is because it actually crashed from running out of memory on one weekend i.e. it must have reached the maximum heap size, and not run the garbage collector.

I think your diagnosis is incorrect. Unless there is something seriously broken about your JVM, then the application will only throw an OOME after it has just run a full garbage collect, and discovered that it still doesn't have enough free heap to proceed*.

I suspect that what is going on here is one or more of the following:

  • Your application has a slow memory leak. Each time you restart the application, the leaked memory gets reclaimed. So, if you restart the application regularly during the week, this could explain why it only crashes on the weekend.

  • Your application is doing computations that require varying amounts of memory to complete. On that weekend, someone sent it a request that required more memory that was available.

Running the GC by hand is not actually going to solve the problem in either case. What you need to do is to investigate the possibility of memory leaks, and also look at the application memory size to see if it is large enough for the tasks that are being performed.

If you can capture heap stats over a long period, a memory leak will show up as a downwards trend over time in the amount of memory available after full garbage collections. (That is the height of the longest "teeth" of the sawtooth pattern.) A workload-related memory shortage will probably show up as an occasional sharp downwards trend in the same measure over a relatively short period of time, followed by a recovery. You may see both, then you could have both things happening.

* Actually, the criteria for deciding when to give up with an OOME are a bit more complicated than this. They depend on certain JVM tuning options, and can include the percentage of time spent running the GC.

FOLLOWUP

@Ogre - I'd need a lot more information about your application to be able to answer that question (about memory leaks) with any specificity.

With your new evidence, there are two further possibilities:

  • Your application may be getting stuck in a loop that leaks memory as a result of the clock time-warping.

  • The clock time-warping may cause the GC to think that it is taking too large a percentage of run time and trigger an OOME as a result. This behaviour depends on your JVM settings.

Either way, you should lean hard on your client to get them to stop adjusting the system clock like that. (A 32 minute timewarp is way too much!!). Get them to install a system service to keep the clock in sync with network time hour by hour (or more frequent). Critically, get them to use a service with an option to adjusts the clock in small increments.

(Re the 2nd bullet: there is a GC monitoring mechanism in the JVM that measures the percentage of overall time that the JVM is spending running the GC, relative to doing useful work. This is designed to prevent the JVM from grinding to a halt when your application is really running out of memory.

This mechanism would be implemented by sampling the wall-clock time at various points. But if the wall-clock time is timewarped at a critical point, it is easy to see how the JVM may think that a particular GC run took much longer than it actually did ... and trigger the OOME.)



回答2:

If possible, I would setup the process to dump the heap if it runs out of memory - so you can analyze it if (when) it happens again. Not an answer, but a potential route to a solution.

Here are the JVM options, taken from Oracle's Java HotSpot VM Options page. (This assumes you have an Oracle JVM):

-XX:HeapDumpPath=./java_pid.hprof

Path to directory or filename for heap dump. Manageable. (Introduced in 1.4.2 update 12, 5.0 update 7.)

-XX:-HeapDumpOnOutOfMemoryError

Dump heap to file when java.lang.OutOfMemoryError is thrown. Manageable. (Introduced in 1.4.2 update 12, 5.0 update 7.)



回答3:

Ok guys, thanks for all your help. The correct answer however, turned out to have nothing to do with the program itself.

It seems that at the time that the memory usage started it's steady climb, the server was synchronising it's time from somewhere internal, though our client's IT contact has no idea where. Obviously, wherever it was coming from, was not a good clock, since the time was half an hour behind. We turned off this synchronisation, and now that I have checked it again this morning, the problem did not occur. So if the time on your system suddenly changes, apparently this causes problems for the garbage collector. At least that's what this implies to me.

As for why this wasn't occuring on any other parts of our system on this server (which are also written in Java), we probably simply haven't noticed, as they don't deal with as large a number of objects, and so they would never have hit their out of memory state.

I find this strange, since I would have thought that the invoking of the garbage collector would be completely related to memory usage, and not on the system time at all. Clearly, my understanding of how the garbage collector works is woefully inadequate.