As a part of my research I'm writing an high-load TCP/IP echo server in Java. I want to serve about 3-4k of clients and see the maximum possible messages per second that I can squeeze out of it. Message size is quite small - up to 100 bytes. This work doesn't have any practical purpose - only a research.
According to numerous presentations that I've seen (HornetQ benchmarks, LMAX Disruptor talks, etc), real-world high-load systems tend to serve millions of transactions per second (I believe Disruptor mentioned about 6 mils and and Hornet - 8.5). For example, this post states that it possible to achieve up to 40M MPS. So I took it as a rough estimate of what should modern hardware be capable of.
I wrote simplest single-threaded NIO server and launched a load test. I was little surprised that I can get only about 100k MPS on localhost and 25k with actual networking. Numbers look quite small. I was testing on Win7 x64, core i7. Looking at CPU load - only one core is busy (which is expected on a single-threaded app), while the rest sit idle. However even if I load all 8 cores (including virtual) I will have no more than 800k MPS - not even close to 40 millions :)
My question is: what is a typical pattern for serving massive amounts of messages to clients? Should I distribute networking load over several different sockets inside a single JVM and use some sort of load balancer like HAProxy to distribute load to multiple cores? Or I should look towards using multiple Selectors in my NIO code? Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them? Will testing on a proper serverside OS like CentOS make a big difference (maybe it is Windows that slows things down)?
Below is the sample code of my server. It always answers with "ok" to any incoming data. I know that in real world I'd need to track the size of the message and be prepared that one message might be split between multiple reads however I'd like to keep things super-simple for now.
public class EchoServer {
private static final int BUFFER_SIZE = 1024;
private final static int DEFAULT_PORT = 9090;
// The buffer into which we'll read data when it's available
private ByteBuffer readBuffer = ByteBuffer.allocate(BUFFER_SIZE);
private InetAddress hostAddress = null;
private int port;
private Selector selector;
private long loopTime;
private long numMessages = 0;
public EchoServer() throws IOException {
this(DEFAULT_PORT);
}
public EchoServer(int port) throws IOException {
this.port = port;
selector = initSelector();
loop();
}
private void loop() {
while (true) {
try{
selector.select();
Iterator<SelectionKey> selectedKeys = selector.selectedKeys().iterator();
while (selectedKeys.hasNext()) {
SelectionKey key = selectedKeys.next();
selectedKeys.remove();
if (!key.isValid()) {
continue;
}
// Check what event is available and deal with it
if (key.isAcceptable()) {
accept(key);
} else if (key.isReadable()) {
read(key);
} else if (key.isWritable()) {
write(key);
}
}
} catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
}
private void accept(SelectionKey key) throws IOException {
ServerSocketChannel serverSocketChannel = (ServerSocketChannel) key.channel();
SocketChannel socketChannel = serverSocketChannel.accept();
socketChannel.configureBlocking(false);
socketChannel.setOption(StandardSocketOptions.SO_KEEPALIVE, true);
socketChannel.setOption(StandardSocketOptions.TCP_NODELAY, true);
socketChannel.register(selector, SelectionKey.OP_READ);
System.out.println("Client is connected");
}
private void read(SelectionKey key) throws IOException {
SocketChannel socketChannel = (SocketChannel) key.channel();
// Clear out our read buffer so it's ready for new data
readBuffer.clear();
// Attempt to read off the channel
int numRead;
try {
numRead = socketChannel.read(readBuffer);
} catch (IOException e) {
key.cancel();
socketChannel.close();
System.out.println("Forceful shutdown");
return;
}
if (numRead == -1) {
System.out.println("Graceful shutdown");
key.channel().close();
key.cancel();
return;
}
socketChannel.register(selector, SelectionKey.OP_WRITE);
numMessages++;
if (numMessages%100000 == 0) {
long elapsed = System.currentTimeMillis() - loopTime;
loopTime = System.currentTimeMillis();
System.out.println(elapsed);
}
}
private void write(SelectionKey key) throws IOException {
SocketChannel socketChannel = (SocketChannel) key.channel();
ByteBuffer dummyResponse = ByteBuffer.wrap("ok".getBytes("UTF-8"));
socketChannel.write(dummyResponse);
if (dummyResponse.remaining() > 0) {
System.err.print("Filled UP");
}
key.interestOps(SelectionKey.OP_READ);
}
private Selector initSelector() throws IOException {
Selector socketSelector = SelectorProvider.provider().openSelector();
ServerSocketChannel serverChannel = ServerSocketChannel.open();
serverChannel.configureBlocking(false);
InetSocketAddress isa = new InetSocketAddress(hostAddress, port);
serverChannel.socket().bind(isa);
serverChannel.register(socketSelector, SelectionKey.OP_ACCEPT);
return socketSelector;
}
public static void main(String[] args) throws IOException {
System.out.println("Starting echo server");
new EchoServer();
}
}
You will acheive tops a few hundred thousand requests per second with regular hardware. At least that is my experience trying to build similar solutions, and the Tech Empower Web Frameworks Benchmark seems to agree as well.
The best approach, generally, depends on whether you have io-bound or cpu-bound loads.
For io-bound loads (high latency), you need to do async io with many threads. For best performance you should try to void handoffs between threads as much as possible. So having a dedicated selector thread and another threadpool for processing is slower than having a threadpool where every thread does either selection or processing, so that that a request gets handled by a single thread in the best case (if io is immediately available). This type of setup is more complicated to code but fast and I don't believe that any async web framework exploits this fully.
For cpu-bound loads one thread per request is usually the fastest, since you avoid context switches.
There are many possible patterns: An easy way to utilize all cores without going through multiple jvms is:
That's the gist of it. There are many more possibilities here and the answer really depends on the type of application you are writing. A few examples are:
Stateful applications which require moderate amounts of processing e.g. a typical business application: Here every client has some state that determines how each request is handled. Assuming we go multi-threaded since the processing is non-trivial, we could affinitize clients to certain threads. This is a variant of the actor architecture:
i) When a client first connects hash it to a worker. You might want to do this with some client id, so that if it disconnects and reconnects it is still assigned to the same worker/actor.
ii) When the reader thread reads a complete request put it on the ring-buffer for the right worker/actor. Since the same worker always processes a particular client all the state should be thread local making all the processing logic simple and single-threaded.
iii) The worker thread can write requests out. Always attempt to just do a write(). If all your data could not be written out only then do you register for OP_WRITE. The worker thread only needs to make select calls if there is actually something outstanding. Most writes should just succeed making this unnecessary. The trick here is balancing between select calls and polling the ring buffer for more requests. You could also employ a single writer thread whose only responsibility is to write requests out. Each worker thread can put it's responses on a ring buffer connecting it to this single writer thread. The single writer thread round-robin polls each incoming ring-buffer and writes out the data to clients. Again the caveat about trying write before select applies as does the trick about balancing between multiple ring buffers and select calls.
As you point out there are many other options:
Should I distribute networking load over several different sockets inside a single JVM and use some sort of load balancer like HAProxy to distribute load to multiple cores?
You can do this, but IMHO this is not the best use for a load balancer. This does buy you independent JVMs that can fail on their own but will probably be slower than writing a single JVM app that is multi-threaded. The application itself might be easier to write though since it will be single threaded.
You can do this too. Look at Ngnix architecture for some hints on how to do this.
Or maybe even distribute the load between multiple JVMs and use Chronicle to build an inter-process communication between them?
This is also an option. Chronicle gives you an advantage that memory mapped files are more resilient to a process quitting in the middle. You still get plenty of performance since all communication is done through shared memory.I don't know about this. Unlikely. If Java uses the native Windows APIs to the fullest, it shouldn't matter as much. I am highly doubtful of the 40 million transactions/sec figure (without a user space networking stack + UDP) but the architectures I listed should do pretty well.
These architectures tend to do well since they are single-writer architectures that use bounded array based data structures for inter thread communication. Determine if multi-threaded is even the answer. In many cases it is not needed and can lead to slowdown.
Another area to look into is memory allocation schemes. Specifically the strategy to allocate and reuse buffers could lead to significant benefits. The right buffer reuse strategy is dependent on application. Look at schemes like buddy-memory allocation, arena allocation etc to see if they can benefit you. The JVM GC does plenty fine for most work loads though so always measure before you go down this route.
Protocol design has a big effect on performance too. I tend to prefer length prefixed protocols because they let you allocate buffers of right sizes avoiding lists of buffers and/or buffer merging. Length prefixed protocols also make it easy to decide when to handover a request - just check
num bytes == expected
. The actual parsing can be done by the workers thread. Serialization and deserialization extends beyond length-prefixed protocols. Patterns like flyweight patterns over buffers instead of allocations helps here. Look at SBE for some of these principles.As you can imagine an entire treatise could be written here. This should set you in the right direction. Warning: Always measure and make sure you need more performance than the simplest option. It's easy to get sucked into a never ending black-hole of performance improvements.
Your logic around writing is faulty. You should attempt the write immediately you have data to write. If the
write()
returns zero it is then time to register for OP_WRITE, retry the write when the channel becomes writable, and deregister forOP_WRITE
when the write has succeeded. You're adding a massive amount of latency here. You're adding even more latency by deregistering forOP_READ
while you're doing all that.