剖析了Netty性能(Profiling Netty Performance)

我正在写一个应用程序的Netty。该应用程序在64位八大核心的Linux机器上运行

所述的Netty应用程序是一个简单的路由器接受请求（传入管道）从请求中读取一些元数据和将数据转发到远程服务（传出管道）。

这种远程服务将返回一个或多个响应即将离任的管道。在Netty的应用将路由响应返回给客户端发起（输入管道）

将有成千上万的客户。将有成千上万的远程服务。

我做了一些小规模的测试（10个客户机，十个遥控器服务），我没有看到子10毫秒的表现，我在99.9个百分点期待。我测量来自客户端和服务器端的延迟。

我使用的是完全异步协议类似于SPDY。我拍摄的时候（我只是使用System.nanoTime（））在我们处理的FrameDecoder的第一个字节。我停止计时器之前我们称之为channel.write（）。我测量亚毫秒时间（99.9百分位数）从输入管道到传出管道，反之亦然。

我还测量的时间从FrameDecoder的第一个字节时，ChannelFutureListener回调被调用的（上文）message.write至（）。当时是一个高几十毫秒（99.9个百分点），但我有麻烦说服自己，这是非常有用的数据。

我最初的想法是，我们有一些慢的客户。我看着channel.isWritable（）和登录时该返回false。这种方法并没有返回正常条件下假

有些事实：

我们正在使用的NIO工厂。我们没有定制工人大小
我们已经禁用了纳格尔（tcpNoDelay =真）
我们已经启用保活（的keepAlive =真）
CPU空闲的时间90 +％
网络处于闲置状态
由GC（CMS）被每隔100秒左右调用的时间非常短的量

有没有办法，我可以遵循，以确定为什么我的Netty的应用没有那么快，因为我相信它应该运行的调试技术？

这感觉就像channel.write（）将消息发送到队列，并且没有透明到这个队列中，我们（使用Netty的应用程序开发人员）。我不知道，如果队列是Netty的队列中，OS队列，网卡队列或什么。反正，我检讨现有的应用程序的例子，我没有看到我之后的任何反模式

感谢您的帮助/洞察力

Answer 1:

网状创建调用Runtime.getRuntime（）。availableProcessors（）* 2名工人在默认情况下。 16你的情况。这意味着你可以处理多达16个通道的同时，其他通道将等待您释放ChannelUpstreamHandler.handleUpstream / SimpleChannelHandler.messageReceived处理器untils，所以不要在这些（IO）线程做繁重的作业，否则可以卡住其他通道。

Answer 2:

您还没有指定您的Netty的版本，但它听起来像Netty的3 4的Netty现在是稳定的，我会建议您尽快更新到它。您指定要超低延迟时间，以及数以千计的客户端和服务的几十。这并没有真正拌匀。 NIO本质上是合理的潜在的，而不是OIO。然而这里的缺陷是，OIO可能将无法达到你希望的客户端的数量。尽管如此，我会用一个OIO事件循环/工厂，看看怎么回事。

我自己有一个TCP服务器，这需要在本地主机上围绕30毫秒来发送和接收和处理几个TCP包（从时间客户机来测量，直至服务器关闭它打开一个套接字）。如果你确实需要这样的低延迟，我建议你从TCP切换出来，由于所需打开一个连接的SYN / ACK垃圾邮件，这是要使用10ms的很大一部分。

Answer 3:

Measuring time in a multi-threaded environment is very difficult if you are using simple things like System.nanoTime(). Imagine the following on a 1 core system:

Thread A is woken up and begins processing the incoming request.
Thread B is woken up and begins processing the incoming request. But since we are working on a 1 core machine, this ultimately requires that Thread A is put on pause.
Thread B is done and performed perfectly fast.
Thread A resumes and finishes, but took twice as long as Thread B. Because you actually measured the time it took to finish for Thread A + Thread B.

There are two approaches on how to measure correctly in this case:

You can enforce that only one thread is used at all times.
This allows you to measure the exact performance of the operation, if the OS does not interfere. Because in the above example Thread B can be outside of your program as well. A common approach in this case is to median out the interference, which will give you an estimation of the speed of your code.
You can however assume, that on an otherwise idle multi-core system, there will be another core to process background tasks, so your measurement will usually not be interrupted. Setting this thread to high priority helps as well.
You use a more sophisticated tool that plugs into the JVM to actually measure the atomic executions and time it took for those, which will effectively remove outside interference almost completely. One tool would be VisualVM, which is already integrated in NetBeans and available as a plugin for Eclipse.

As a general advice: it is not a good idea to use more threads than cores, unless you know that those threads will be blocked by some operation frequently. This is not the case when using non-blocking NIO for IO-operations as there is no blocking.

Therefore, in your special case, you would actually reduce the performance for clients, as explained above, because communication would be put on hold up to 50% of the time under high load. In worst case, that could cause a client to even run into a timeout, as there is no guarantee when a thread is actually resumed (unless you explicitly request fair scheduling).

文章来源: Profiling Netty Performance