I have an ECHO server application based on a TCPListener
. It accepts clients, read the data, and returns the same data. I have developed it using the async/await approach, using the XXXAsync
methods provided by the framework.
I have set performance counters to measure how many messages and bytes are in and out, and how many connected sockets.
I have created a test application that starts 1400 asynchronous TCPClient
, and send a 1Kb message every 100-500ms. Clients have a random waiting start between 10-1000ms at the beginning, so they not try to connect all at the same time. I works well, I can see in the PerfMonitor the 1400 connected, sending messages at good rate. I run the client app from another computer. The server's CPU and memory usage are very little, it is a Intel Core i7 with 8Gb of RAM. The client seems more busy, it is an i5 with 4Gb of RAM, but still not even the 25%.
The problem is if I start another client application. Connections start to fail in the clients. I do not see a huge increase in the messages per second (a 20% increase more or less), but I see that the number of connected clients is just around 1900-2100, rather than the 2800 expected. Performance decreases a little, and the graph shows bigger variations between max and min messages per second than before.
Still, CPU usage is not even the 40% and memory usage is still little. I have tried to increase the number or pool threads in both client and server:
ThreadPool.SetMaxThreads(5000, 5000);
ThreadPool.SetMinThreads(2000, 2000);
In the server, the connections are accepted in a loop:
while(true)
{
var client = await _server.AcceptTcpClientAsync();
HandleClientAsync(client);
}
The HandleClientAsync
function returns a Task
, but as you see the loop does not wait for the handling, just continues to accept another client. That handling function is something like this:
public async Task HandleClientAsync(TcpClient client)
{
while(ws.Connected && !_cancellation.IsCancellationRequested)
{
var msg = await ReadMessageAsync(client);
await WriteMessageAsync(client, msg);
}
}
Those two functions only read and write the stream asynchronously.
I have seen I can start the TCPListener
indicating a backlog
amount, but what is the default value?
Why could be the reason why the app is not scaling up till it reaches the max CPU?
Which would be the approach and tools to find out what the actual problem is?
UPDATE
I have tried the Task.Yield
and Task.Run
approaches, and they didn't help.
It also happens with server and client running locally in the same computer. Incrementing the amount of clients or messages per second, actually reduces the service throughput. 600 clients sending a message each 100ms, generates more throughput than 1000 clients sending a message each 100ms.
The exceptions I see on the client when connecting more than ~2000 clients are two. With around 1500 I see the exceptions at the beginning but the clients finally connect. With more than 1500 I see lot of connection/disconnection :
"An existing connection was forcibly closed by the remote host" (System.Net.Sockets.SocketException) A System.Net.Sockets.SocketException was caught: "An existing connection was forcibly closed by the remote host"
"Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host." (System.IO.IOException) A System.IO.IOException was thrown: "Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host."
UPDATE 2
I have set up a very simple project with server and client using async/await and it scales as expected.
The project where I have the scalability problem is this WebSocket server, and even when it uses the same approach, apparently something is causing contention. There is a console application hosting the component, and a console application to generate load (although it requires at least Windows 8).
Please note that I am not asking for the answer to fix the problem directly, but for the techniques or approaches to find out what is causing that contention.
I have managed to scale up to 6,000 concurrent connections without problems and processing around 24,000 messages per second connecting from machine no machine (no localhost test) and using only around 80 physical threads.
There are some lessons I learnt:
Increasing the thread pool size made things worse
Do not do unless you know what you are doing.
Call Task.Run or yield with Task.Yield
To ensure you release the calling thread from attending the rest of the method.
ConfigureAwait(false)
From your executable application if you are confident you are not in a single threaded synchronization context, this allows any thread to pick up the continuation rather than wait specifically for the one that started to become free.
Byte[]
The memory profiler showed that the app was spending too much memory and time in creating
Byte[]
instances. So I designed several strategies to reuse the available ones, or just work "in place" rather than create new ones and copy. The GC performance counters (specifically "% time in GC", that was around 55%) raised the alarm that something was not right. Also, I was usingBitArray
instances to check bits in bytes, what caused some memory overhead as well, so I replace them with bit wise operations and it improved. Later on I discovered than WCF uses aByte[]
pool to cope with this problem.Asynchronous does not mean
fast
Asynchronous allows scale nicely, but it has a cost. Just because there is an available asynchronous operation does not mean you should use it. Use asynchronous programming when you presume it will take sometime waiting before getting the actual response. If you are sure the data is there or the response will be quick, proceed synchronously.
Support sync and async is tedious
You have to implement the methods twice, there is no bulletproof way of rehusing async from sync code.
Well, for one, you're running everything on one thread, so changing the ThreadPool isn't going to make any difference.EDIT: As Noseration pointed out, this is not actually true. While IOCP and the asynchronous socket itself doesn't actually require additional threads for I/O requests, the default implementation in .NET does. The completion event is processed on a
ThreadPool
thread, and it is your responsibility to either supply your ownTaskScheduler
, or queue the event and process it manually on a consumer thread. I'm going to leave the rest of the answer, because it's still relevant (and the thread switching isn't a performance issue here, as described later in the answer). Also note that the defaultTaskScheduler
in an UI application usually does use a synchronization context, so in eg. winforms, the completion event would be processed on the UI thread. In any case, throwing more threads than CPU cores on the problem isn't going to help.However, this isn't necessarily a bad thing. I/O bound operations don't benefit from being run on a separate thread, in fact, it's very inefficient to do so. That's exactly what
async
and IOCP is for, so keep using it.If you're starting to get significant CPU usage, that's where you want to make things parallel, as opposed to simply asynchronous. Still, receiving the messages on one thread using
await
should be just fine. Handling multi-threading is always tricky, and there are lots of approaches for different situations. In practice, you usually don't want more threads than you have processor cores available - if they're competing for I/O, useasync
. If they're competing for CPU, that's only going to get worse with more threads than the CPU can process in parallel.Note that since you're running on one thread, one of your processor cores might very well be running at 100%, while the rest do nothing. You can verify this in task manager easily.Also, note that the amount of TCP connections you can have open at one time is very much limited. Each connection has to have its own ports on both the client and the server. The default values for client Windows are somewhere in the line of 1000-4000 ports for that. That's not a lot for a server (nor your load-testing clients).
If you open and close connections as well, this gets even worse, because TCP ports are guaranteed to be open for some time (up to four minutes after being disconnected). This is because opening a new TCP connection on the same port might mean that data for the old connection might arrive on the new connection, which would be very, very bad.
Please, add more information. What does
ReadMessageAsync
andWriteMessageAsync
do? Is it possible that the performance impact is caused by GC? Have you tried profiling the CPU and memory? Are you sure you're not actually exhausting the network bandwidth with all those TCP messages? Have you checked if you're experiencing TCP port exhaustion, or high packet loss scenarios?UPDATE: I've written a test server and client, and they can exhaust the available TCP ports in under a second, including all initializations, when using asynchronous sockets. I'm running this on localhost, so each client connection actually takes two ports (one for server, one for client), so it's somewhat faster than when the client is on a different machine. In any case, it's obvious that the issue in my case is TCP port exhaustion.