debugging JBoss 100% CPU usage

2020-05-28 23:48发布

Originally posted on Server Fault, where it was suggested this question might better asked here.

We are using JBoss to run two of our WARs. One is our web app, the other is our web service. The web app accesses a database on another machine and makes requests to the web service. The web service makes JMS requests to other machines, aggregates the data, and returns it.

At our biggest client, about once a month the JBoss Java process takes 100% of all CPUs. The machine running JBoss has 8 CPUs. Our web app is still accessible during this time, however pages take about 3 minutes to load. Restarting JBoss restores everything to normal.

The database machine and all the other machines are fine, only the machine running JBoss is affected. Memory usage is normal. Network utilization is normal. There are no suspect error messages in the JBoss logs.

I have set up a test environment as close as possible to the client's production environment and I've done load testing with as much as 2x the number of concurrent users. I have not gotten my test environment to replicate the problem.

Where do we go from here? How can we narrow down the problem?

Currently the only plan we have is to wait until the problem occurs in production on its own, then do some debugging to determine the cause. So far people have just restarted JBoss when the problem occurred to minimize down time. Next time it happens they will get a developer to take a look. The question is, next time it happens, what can be done to determine the cause?

We could setup a separate JBoss instance on the same box and install the web app separately from the web service. This way when the problem next occurs we will know which WAR has the problem (assuming it is our code). This doesn't narrow it down much though.

Should I enable JMX remote? This way the next time the problem occurs I can connect with VisualVM and see which threads are taking the CPU and what the hell they are doing. However, is there a significant down side to enabling JMX remote in a production environment?

Is there another way to see what threads are eating the CPU and to get a stacktrace to see what they are doing?

Any other ideas?

Thanks!

4条回答
Bombasti
2楼-- · 2020-05-29 00:33

If you are using JBoss 5.1.0 EAP, there is a bug in Jboss and they also have a fix. Here is the URL: https://issues.jboss.org/browse/JBPAPP-5193

查看更多
你好瞎i
3楼-- · 2020-05-29 00:37

There's a quick and dirty way of identifying which threads are using up the CPU time on JBoss. Go the the JMX Console with a browser (usually on http://localhost:8080/jmx-console, but may be different for you), look for a bean called ServerInfo, it has an operation called listThreadCpuUtilization which dumps the actual CPU time used by each active thread, in a nice tabular format. If there's one misbehaving, it usually stands out like a sore thumb.

There's also the listThreadDump operation which dumps the stack for every thread to the browser.

Not as good as a profiler, but a much easier way to get the basic information. For production servers, where it's often bad news to connect a profiler, it's very handy.

查看更多
爷的心禁止访问
4楼-- · 2020-05-29 00:45

I think you should definitely try to set up a test environment with some load testing in order to reproduce your issue. Profiling would definitely help in order to pinpoint the problem.

A quick fix would be to next time kill jboss with kill -3 in order get a dump to analyze. Second thing I would check is that you are running with -server flags and that your gc settings are sane. You could also just run some dstat to see what the process is doing during the lockup. But again - it is probably safer to just set up a load testing environment (via EC2 or so) to reproduce this.

查看更多
太酷不给撩
5楼-- · 2020-05-29 00:50

This typically happens with runaway code or unsafe thread access to hashmaps. A simple thread dump (kill -3, as @disown says, or ctrl-break in a windows console) will reveal this problem.

Since you're unable to reproduce it using tests I think it smells like a concurrency issue; it's usually hard to make test scripts behave sufficiently random to catch issues of this type.

I normally try to make it standard operating procedure to do thread-dumps of any JVM that is restarted due to operational anomalies, and it's really a requirement to catch those once-a-month things.

查看更多
登录 后发表回答