A use case for a manual GC invocation?

2019-01-20 19:50发布

问题:

I've read why is it bad practice to call System.gc(), and many others, e.g. this one describing a really disastrous misuse of System.gc(). However, there are cases when the GC takes too long and avoiding long pauses, e.g., by avoiding garbage is not exactly trivial and makes the code harder to maintain.

IMHO calling GC manually is fine in the following common scenario:

  • There are multiple interchangeable webserves with a failover in front of them.
  • Every server uses a few gigabytes of heap and the STW pauses take much longer than an average request.
  • The failover has no idea when GC is going to happen.
  • The failover can exempt a server when told to.

The algorithm seems to be trivial: Periodically select a server, let no more requests be send to it, let it finished its running requests, let it do its GC, and re-activate the server.

I wonder if I am missing something?1,2

What are the alternatives?

  1. Long running requests could be a problem, but let's assume there are none. Or simply limit waiting to some period comparable with what GC takes. Making a slow request even slower doesn't sound too bad.

  2. An option like -XX:+DisableExplicitGC could make the algorithm useless, but just don't use it (my use case includes dedicated servers I'm in charge of).

回答1:

For low latency trading systems I use the GC in an atypical manner.

You want to avoid any collection, even minor ones during the trading day. A way to do this is to create less than 300 KB of garbage a second. This is around 1 GB per hour, or up to 24 GB per day. When you use a 24 GB Eden space it means there is no minor/major GCs. However to ensure a GC occurs at a time which is planned and acceptable, a System.gc() is called at say 5 AM each morning and you have a clean Eden space for the next day.

There are times, when you create more garbage than expected e.g. failing to reconnect to a data source, and you might get a small number of minor collections. However this only happens when something is going wrong.

For more details http://vanillajava.blogspot.co.uk/2011/06/how-to-avoid-garbage-collection.html

by avoiding garbage is not exactly trivial and makes the code harder to maintain.

Avoiding garbage entirely is near impossible. However 300 KB/s is not so hard for a JVM. (You can have more than one JVM these days on one machine with 24 GB Eden spaces)

Note if you can keep below 50 KB/s of garbage you can run all week with out a GC.

Periodically select a server, let no more requests be send to it, let it finished its running requests, let it do its GC, and re-activate the server.

You can treat a GC as a failure to meet your SLA condition. In this case you can remove a server when you determine this is about to happen from your cluster, Full GC it and return it to the cluster.



回答2:

However, there are cases when the GC takes too long and avoiding long pauses

You have to distinguish between pauses caused by young-only, mixed/concurrent phase and full GCs.

In most cases it's the full GCs that you want to avoid while the other ones are acceptable, which can often be achieved with some GC-tuning and optimizing the code to avoid large allocation bursts.

In principle G1 should be able to run forever on young/mixed cycles and a full GC could be considered a soft failure. CMS can at least do so for many days with careful tuning, but may eventually succumb to fragmentation and require a full GC for compacting.

In cases where even the young GC pauses are not acceptable or garbage piles up too fast for the concurrent phase to handle with acceptable pause times then the strategy you outline may be a viable workaround.

Note that there also other use-cases for manually triggering GCs, such as GC-managed native resources, e.g. direct byte buffers, although those are fairly troublesome in general.

Also note that not all System.gc() calls are created equal, there is the ExplicitGCInvokesConcurrent option too.