Sporadic TCP connection failures (WSAEHOSTUNREACH)

2019-07-04 22:33发布

问题:

On a local gigabit network, I have an application using a single TCP server and many clients. Each client pings the server every 30 seconds, by opening a TCP connection, sending it a status message, and closing.

The server is set up using SocketAsyncEventArgs very similarly to the example shown HERE (omitted for brevity)

The clients initiate the connection using a TcpClient.

Relevant section of client code:

using (TcpClient client = new TcpClient())
{
     IAsyncResult ar = client.BeginConnect(address, port, null, null);
     if (!ar.AsyncWaitHandle.WaitOne(timeout))
     {
         throw new ApplicationException("Timed out waiting for connection to " + address);
     }
     client.EndConnect(ar); //exception thrown 5%-10% of the time

     //...send message and receive response...
 }

Everything works fine, except that on some machines, an exception is thrown only 5%-10% of the time on EndConnect.

The exception is a WSAEHOSTUNREACH (10065):

System.Net.Sockets.SocketException (0x80004005): A socket operation was attempted to an unreachable host 192.168.XXX.XXX:XXXX
at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult)
at System.Net.Sockets.TcpClient.EndConnect(IAsyncResult asyncResult)
  • The issue is definitely not congestion, this happens even when only one client is up and running, and at hours when network traffic is minimal.
  • I can see that EndConnect is being called very shortly after the call to BeginConnect, no time is spent inside ar.AsyncWaitHandle.WaitOne.

My question is how can I debug this type of error? The server is definitely up at this time.

回答1:

The problem seems to have been related to windows sleep mode. When the machine was asleep, it would generate these exceptions occasionally.

Disabling sleep mode using SetThreadExecutionState as outlined here seems to have taken care of the issue.

Still, I am not sure why I was getting SocketExceptions in this case. I could understand if the timer didn't fire at all, but not sure why connection would fail.