I've had a question open on here about improving the performance of my 30gb workflow application in XPages. There were lots of suggestions but most involve recycling, improving code etc. and what actually fixed the issues with speed are not often talked about - the advanced tab in the application properties (see my last post)
Now I have an application that runs really well, it is fast and people are happy BUT the server still periodically crashes. Or I should say, HTTP becomes unresponsive and in extreme cases runs 100% CPU so Domino is also sluggish but still running.
I've been monitoring the HTTP threads with
tell http show thread state
And in most cases I see 80 http threads that are idle, or doing something but quickly released. Since the last update to the application where we have been way more focused on recycling the notes objects in SSJS, I thought we would see the end of the hung http thread but it's still there.
I'm almost positive that it's not an infinite loop that's causing this problem because 2 cases that I have confirmed with the end users are completely different and there are no loops as far as I can tell.
User is editing a document, presses a workflow button to approve and send it on to the next user. They are using Chrome. The spinning circle on the chrome tab starts, the server is then supposed to run the workflow agent, send emails, and then close the page on the browser. I noticed that there were 2 or 3 hung http threads that hadn't been released after an hour so I contacted the user and she told me the page hadn't refreshed but the spinning circle was still spinning in chrome suggesting the server was doing something. I checked the logs and the workflow agent HAS run, emails have been sent and the document is updated. She refreshed the page and can now see that it's been updated but for whatever reason Chrome sat there waiting patiently and never received the message that the LS Agent had run. I use notesAgent.runOnServer and return the resulting integer to confirm if the agent has run or not. If it returns 1 (i think) then the page is supposed to close, otherwise it should display a message. The page never refreshed so it didn't display anything, but the agent did complete.
A user in the evening ended up with about 15 hung http threads. In the logs I could see she was trying to reload the page multiple times. Then there was a search for the document she wanted, and then more attempts to open it. When I checked she said she searched for the document, the search page showed the results (in a repeat control), and every time she clicked the document to open it nothing happened. So she didn't even get in to the document yet the threads were hanging after each attempt. I got the URL from the notes log and tried it, document opens fine. I ran the same search, document opens fine. I send her a link to the document directly and it opens fine for her. Weird!!
Is there ANY way to diagnose this sort of behaviour because right now I have to have domino admin open running the tell http show.... command all day keeping an eye on it to make sure threads aren't hanging. It usually gets to lunch time and the server needs a reboot, which is rubbish.
Please help my sanity :-)
There are two actions I strongly recommend:
And use Frantisek's insights
Look for locked threads. As I said before, get more information - in your case javadumps will help. Issue server command
tell http dump java core
(button in browser won't help when http is frozen :-)). This generates javacore file and use IBM Thread and Monitor Dump Analyzer for Java to see the state of every thread.What you will (probably) find is that threads wait for some Notes API call. Post sample stacktraces of hung threads, please.
I used some of the Java dumps to analyse them for memory leaks and while I found a ton of information in there, not really knowing what I was looking for meant I found no conclusions.
Purely by chance I needed to work with a document in the system that randomly causes these thread locks and I manually ran the Workflow agent that gets called when these documents are processed. It took me a minute to realise what was going on but the agent appeared to be getting stuck in a loop when trying to generate a mime email based on the content of the document.
When a scheduled agent gets stuck in an infinite loop in Domino it's easy to spot just by looking at Agent Manager in the Admin Client. You will see it constantly consuming CPU and never finishing.
When an XPage calls an agent that gets stuck however, there is no clue (at least in my case) that agent manager is running and the HTTP Server task doesn't show that it's doing anything out of the ordinary. This is why I initially thought there were no infinite loops but I was completely wrong!
I added some code to count the number of loops that had been reached in the mime email generator routine and added a break if it reached some arbitrarily high value indicating it was stuck in a loop. Et Voilà! No more hung http threads!
This was a great excuse to go through the entire system and fix some of the old (poor) code in there, and tidy everything up. Ultimately though it was the agent not the xpage that was causing the issue. Thanks for everyone's suggestions though.