Process execution tracing tools

2020-07-25 03:46发布

问题:

I am currently in the process of investigating a very peculiar problem on our lab servers. Whenever we run a java program on a machine with a 64-bit SUSE SLES11 installation that has been accessed with Citrix, it just hangs. I have the latest updates on the machine but it doesn't help. If any of these circumstances change, it works: 32-bit OS, SLES10.2, access via Cygwin/Exceed and other X applications such as xclock work fine.

This might look like a ServerFault question so far, but what I'm actually looking for is suggestions on software I can use to trace what this software is actually doing. Where it hangs is on a "FUTEX_WAIT" (found by using strace):

futex(0x7f4e3eaab9e0, FUTEX_WAIT, 19686, NULL

The cursor just stops in the trace just after the NULL and just stays there indefinitely. I have found a previous bug report that looks a little similar to this problem, but the circumstances are very different.

UPDATE: Apparently, futex_wait problems are a sign of strange race conditions in the kernel/libc locking up processes. I will have to try with a newer kernel/libc and see if either of that makes any difference.

UPDATE2: kernel/libc changes made no difference. Did manage to start up jvisualvm and hang it with a predictable external JMX port and connected to that from another machine at which point I found this in the thread trace for main:

Name: main
State: RUNNABLE
Total blocked: 0  Total waited: 0

Stack trace: 
sun.awt.X11GraphicsDevice.getDoubleBufferVisuals(Native Method)
sun.awt.X11GraphicsDevice.makeDefaultConfiguration(X11GraphicsDevice.java:208)
sun.awt.X11GraphicsDevice.getDefaultConfiguration(X11GraphicsDevice.java:182)
   - locked java.lang.Object@1c190c99
sun.awt.X11.XToolkit.<clinit>(XToolkit.java:92)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:169)
java.awt.Toolkit$2.run(Toolkit.java:834)
java.security.AccessController.doPrivileged(Native Method)
java.awt.Toolkit.getDefaultToolkit(Toolkit.java:826)
   - locked java.lang.Class@308a1f38
org.openide.util.ImageUtilities.ensureLoaded(ImageUtilities.java:519)
org.openide.util.ImageUtilities.access$200(ImageUtilities.java:80)
org.openide.util.ImageUtilities$ToolTipImage.createNew(ImageUtilities.java:699)
org.openide.util.ImageUtilities.getIcon(ImageUtilities.java:487)
   - locked java.util.HashMap@3c07ae6d
org.openide.util.ImageUtilities.getIcon(ImageUtilities.java:361)
   - locked java.util.HashMap@1c4c94e5
org.openide.util.ImageUtilities.loadImage(ImageUtilities.java:139)
org.netbeans.core.startup.Splash.loadContent(Splash.java:262)
org.netbeans.core.startup.Splash$SplashComponent.<init>(Splash.java:344)
org.netbeans.core.startup.Splash.<init>(Splash.java:170)
org.netbeans.core.startup.Splash.getInstance(Splash.java:102)
org.netbeans.core.startup.Main.start(Main.java:301)
org.netbeans.core.startup.TopThreadGroup.run(TopThreadGroup.java:110)
java.lang.Thread.run(Thread.java:619)

Tried the deadlock detection button in jvisualvm but it discovered no deadlocks.

Currently talking to Citrix Europe about this problem and delivering traces to them. Will update this question if it gets solved.

UPDATE 3: This problem has been traced to Citrix and has been submitted with service request number 60235154. Seems like the problem is either somewhere in Java or in the Citrix implementation of X11 at the moment.

回答1:

ltrace traces shared-library function calls. That can give you a higher-level view of things. But it can also spew tons more output than strace, since many library functions (e.g. strcmp) don't result in system calls.

But futex is used for locking, so if you get stuck at futex, you probably deadlocked. Or you're just looking at one thread which is waiting for other threads. ltrace/strace -f follows clone/fork to trace all threads/all child processes.

In gdb, sometimes thread apply all <command> is useful for multithreaded processes. e.g. thread apply all bt



回答2:

Do you have source code for the Java program? If so, you can remotely debug it using Eclipse or another IDE. If you don't have source code, your options are more limited, but you can try connecting to the process via JConsole to gain some insight into what's happening. Java profiling tools are another option, but harder to set up.



回答3:

Maybe jvisualvm, which comes with the java from Sun, has what you need. You can record the state of the virtual machine as your program is running and also tell it to save any stack dumps to a file you can later open and look at. Look for jvisualvm in the bin directory of your jdk. Here's where you can see more documentation: http://java.sun.com/javase/6/docs/technotes/tools/share/jvisualvm.html

Good luck!



回答4:

Use gdb to attach to the process. gdb isn't exactly intuitive, but there are a lot of howtos and similar on the net.

http://dirac.org/linux/gdb/06-Debugging_A_Running_Process.php



回答5:

See this solution I have found.

In this case the hangs were caused by slow generation of random bytes from /dev/random.

The Java application waits for very long time to get random bytes.

This is not really a solution, but rather a workarround since the /dev/random will become the same as /dev/urandom.



标签: java linux trace