I'm facing extremely slow performance of methods:
HttpRequestDecoder.unfoldAndFireMessageReceived()
and
Future$PromiseCompletingRunnable.run()
This two methods use around half of the time of every transaction in the server. It happens under low troughtput and during high usage hours.
For example, at 1am with only me doing requests to the aplication I get graphs in new relic like these:
In this transaction only those two methods consume 1 full second, even slower that access the database via hibernate! Once again, only one user in the application.
If the transaction is heavier it takes even more time:
In this case, this two methods consume 2.5 seconds average when my own code consumes 1.5 seconds giving a total of 4 seconds. I thougth at that moment that maybe this was only a misleading of new relic metric. Maybe newrelic was showing this methods names but it was really code written by me. So I decided to get a custom metric like this:
playController(){
//Start timer
//do the job
//stop the timer() and send metric to new relic
//return;
}
And the result was that my code was taking 1.5 seconds. So it is really play request handler who is consuming this time.
This behaivour is killing my app when having a high load. This two methods can consume up to 20 seconds when the throghput is around 500 request per minute (not really a high throughput!) but my code keeps steady at max 3 seconds.
I really don't think this is a thread problem, because it even happends when there's a single user, but it become really problematic when having many concurrent requests. I tried changing the number of threads for "synchronous apps" like documentation mention but I didn't get any performance change, it even got worse.
I am really concerned about this problem because there's a similar case in the mailing lists of play with more than two years and no answer!:
http://grokbase.com/t/gg/play-framework/159bzf7r9p/help-to-understand-newrelic-report-for-slow-transactions-2-1-4
There's even a question that is similar in StackOverflow but for play 2.1 without answer and no aparent activity:
Slow transactions in NewRelic having Play Framework as backend
Any ideas what can be causing this behavior?
So after a month I finally can say this issue is solved. And the answer is that there's no issue at all. New relic default instrumentation does not report correctly the time consumed by Play Framework 2 transactions and I could even say by any async framework that runs over Netty.
To get to that conclusion I had to include some custom metrics to the most problematic transactions just to discover that my custom instrumentation uses way less time than the reported by new relic.
After that I tested in the client using firebug and the times reported match my custom metrics.
Just a week ago I discovered this post in newrelic forums:
https://docs.newrelic.com/docs/agents/java-agent/frameworks/disable-scala-netty-akka-play-2-instrumentation
And after disabling all the instrumentation for netty, akka and play with this lines in newrelic configuration file I finally started to get realistic times by the default instrumentation:
common: &default_settings
class_transformer:
# Disable all Akka instrumentations
com.newrelic.instrumentation.akka-2.0:
enabled: false
com.newrelic.instrumentation.akka-2.1:
enabled: false
com.newrelic.instrumentation.akka-2.2:
enabled: false
# Disable all Netty instrumentations
com.newrelic.instrumentation.netty-3.4:
enabled: false
com.newrelic.instrumentation.netty-3.8:
enabled: false
com.newrelic.instrumentation.netty-4.0.0:
enabled: false
com.newrelic.instrumentation.netty-4.0.8:
enabled: false
# Disable all Play 2 instrumentations
com.newrelic.instrumentation.play-2.1:
enabled: false
com.newrelic.instrumentation.play-2.2:
enabled: false
com.newrelic.instrumentation.play-2.3:
enabled: false
# New in Release 3.22, the Play 2.4 instrumentation does not respect
# the older play2_instrumentation configuration setting
com.newrelic.instrumentation.play-2.4:
enabled: false
# Disable all Scala-language instrumentations
com.newrelic.instrumentation.scala-2.9.3:
enabled: false
In newrelic documentations says:
You may choose to disable some or all of this instrumentation if you
find that the metrics reported aren't valuable to you, or if the
instrumentation incurs more overhead than you would like. If you
selectively disable some of the instrumentation, some segments of
activity will not be reported and your total time will be understated.
But IMHO it should say:
You may choose to disable ALL of this instrumentation if you want to
get realistic metrics.
Why this behavior? I can only guess that Play and Netty reuses some threads in a pool by many transactions and newrelic agent is not able to separate correctly which time is consumed by the database and by netty, duplicating and sometimes triplicating the real times consumed by the application.
This problem mislead my team (and the sponsors of the project) drastically. I don't fully blame Newrelic, this tool is helpful but this left me as a lesson to don't trust only in one tool.