JVM crashes when calling JNI function during gc

2019-07-08 23:14发布

We have a Java application that has a JNI layer that is multi-threaded (pthread) and will call back to the Java level upon messages received from the underlying network.

We notice that every time it crashes, it is caused by a gc. We can even simulate such a crash by manually trigger a gc by calling jmap -histo <pid> while the JNI layer is receiving messages from the network.

Given the information that we have read about the behaviours in JVM during GC in this post, https://stackoverflow.com/a/39401467/4523221, we still couldn't figure out why such crash is related to gc since JNI function calls are blocked during gc.

If anyone can shed light on this, it will be great. Thanks in advance.

The following is a stack trace that we have collected after a crash in our application.

Program terminated with signal 6, Aborted.
#0  0x0000003cdce325e5 in raise () from /lib64/libc.so.6
#1  0x0000003cdce33dc5 in abort () from /lib64/libc.so.6
#2  0x00007fdafe2516b5 in os::abort(bool) () from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#3  0x00007fdafe3efbf3 in VMError::report_and_die() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#4  0x00007fdafde2f3e2 in report_vm_error(char const*, int, char const*, char const*) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#5  0x00007fdafe24c1ff in os::PlatformEvent::park() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#6  0x00007fdafe20c538 in Monitor::ILock(Thread*) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#7  0x00007fdafe20c73f in Monitor::lock_without_safepoint_check() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#8  0x00007fdafe2e7a1f in SafepointSynchronize::block(JavaThread*) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#9  0x00007fdafe39bcdd in JavaThread::check_safepoint_and_suspend_for_native_trans(JavaThread*) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#10 0x00007fdafe0123d8 in jni_NewByteArray ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#11 0x00007fdaa447b7d1 in JNIEnv_::NewByteArray (this=0x7fdaf800c9f8, len=7)
    at /usr/java/jdk1.8.0_65/include/jni.h:1643
---omitted---
#19 0x0000003cdd20b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#20 0x00007fdafe24c133 in os::PlatformEvent::park() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#21 0x00007fdafe20ce27 in Monitor::IWait(Thread*, long) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#22 0x00007fdafe20d5f0 in Monitor::wait(bool, long, bool) ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
---Type <return> to continue, or q <return> to quit---
#23 0x00007fdafe39ed51 in Threads::destroy_vm() ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#24 0x00007fdafdfff931 in jni_DestroyJavaVM ()
   from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#25 0x00007fdafe91a63d in JavaMain () from /usr/java/jdk1.8.0_65/bin/../lib/amd64/jli/libjli.so
#26 0x0000003cdd207aa1 in start_thread () from /lib64/libpthread.so.0
#27 0x0000003cdcee8aad in clone () from /lib64/libc.so.6

The way we obtained JNIEnv* e.g.

JNIEnv *env = 0;
jint result = jvm->GetEnv((void **) &env, JNI_VERSION_1_8);
if (result != JNI_OK) {
    result = jvm->AttachCurrentThread((void **) &env, NULL);

1条回答
2楼-- · 2019-07-08 23:22

After spending days investigating this JNI issue, we have finally found out the reason and I would like to share our experience here so that hopefully it will help others.

First of all, the reason why we needed to use JNI in the first place was because we needed to make use of a 3rd party network library that was a Linux native lib, and unfortunately that was the cause of our problem.

The library provided us a callback handle that we implemented to receive incoming network messages from it, and this callback, we later found out, was simply a signal handler. So, it means that this signal handler would get called whenever a signal popped up, even during gc.

Since C threads keep running during safepoints in JVM, it would have been fine if those C threads weren't attached to the JVM, otherwise disasters would certainly strike.

Here is kind of what we thought had happened. (everything below happened in the JNI layer)

  1. App starts. We init and cached JNI resources, e.g. Jmv*, Method ID, etc.
  2. We registered a C function to the library to receive messages. The C function is a function that would call JNI APIs to allocate memory to accommodate the received messages and pass them onto Java. After that, we just started to wait for incoming messages.
  3. When a message finally arrived, the C function mentioned above was called to handle the message, but wait... what was this thread that's handling the callback. That would have been the main thread or hmm... any available threads.
  4. As taught in any JNI textbook, we attached the thread to JVM first if not yet done so before calling any JNI APIs. Great!
  5. Now, during a GC, all Java threads were blocked, but the C layer was still running. At this critical moment, if a message arrived, some thread (any threads) was called up to handle the message. But what threads were still available during gc? All application threads were blocked and the only ones that were still running at this moment (our guess) were unfortunately the gc threads.

The gdb stacktrace that we were seeing was basically what happened when a gc thread that was actually in a middle of doing some work on the heap and then got a call from our application to do some application work and then a few JNI API calls... BOOM

Solution:

  • Have a C thread that handles the library callback
  • Never attach that thread to JVM
  • Have other threads attach to the JVM to do the Native-Java transition.

p.s. maybe some of the details weren't exactly accurate, so any JVM expert advice is welcomed. I will try to correct them as advised.

Thanks

Update.1 (@apangin): We have another gdb stacktrace here. Just wondering if the GangWorker at #18 was a parallel GC thread.

#0  0x00000035b90325e5 in raise () from /lib64/libc.so.6
#1  0x00000035b9033dc5 in abort () from /lib64/libc.so.6
#2  0x00007febd60813b5 in os::abort(bool) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#3  0x00007febd6223673 in VMError::report_and_die() () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#4  0x00007febd60868bf in JVM_handle_linux_signal () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#5  0x00007febd607ce13 in signalHandler(int, siginfo*, void*) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#6  <signal handler called>
#7  0x00007feb9fcf551c in JNIEnv_::NewByteArray (this=0x7febd001d9f8, len=8) at /usr/java/jdk1.8.0_131/include/jni.h:1643
*<omitted app specific calls>*
#13 <signal handler called>
#14 0x00000035b980b68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#15 0x00007febd607b7e3 in os::PlatformEvent::park() () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#16 0x00007febd603c037 in Monitor::IWait(Thread*, long) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#17 0x00007febd603c956 in Monitor::wait(bool, long, bool) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#18 0x00007febd6244d6b in GangWorker::loop() () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#19 0x00007febd6082568 in java_start(Thread*) () from /usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
#20 0x00000035b9807aa1 in start_thread () from /lib64/libpthread.so.0
#21 0x00000035b90e8aad in clone () from /lib64/libc.so.6
查看更多
登录 后发表回答