We have a Netty (4.0.15) based Websocket server running on Ubuntu v10, and during resiliency testing we do:
- kill -9 server
- send some data from client
- expect writeAndFlush failure on client
For some reasons sometimes we see:
- writeAndFlush success and then after
- java.io.IOException: Connection reset by peer
So is it possible the writeAndFlush sometimes completes successfully even if the server is gone, whilst other times it fails?
Maybe this occurs because of the schedule of the OS socket clean-up mechanism for killed processes?
Client test code:
channel.writeAndFlush(new TextWebSocketFrame("blah blah")).addListeners(
<snip>
public void operationComplete(ChannelFuture future) {
assert future.isSuccess() == false; <-- sometimes this is not triggered
}
</snip>
Thanks for any ideas,
It's a simple race condition, and something that you have to accept can happen. You can only determine that the remote host has disappeared by not receiving data from it. Generally this is achieved by setting a timer and assuming that if data hasn't been received (possibly in response to a keep alive message) the remote host is dead.
Essentially TCP assumes that the remote host is dead if it attempts to retransmit some data a certain number of times without receiving an acknowledgement, or it does not receive a response to keep alive (which is usually off by default). However, assuming there is room in your host's send buffer, you can continue to call writeAndFlush successfully as it will simply be queued in the network buffers. WriteAndFlush is considered to have succeeded once Netty has written the data to the kernel send buffer. There is no way of determining whether the data reached the remote host without an application level acknowledgement. Thus you may be calling writeAndFlush while TCP is in the process of determining that the remote host has died and so writeAndFlush succeeds but the data is not sent. Alternatively you may call writeAndFlush at the same time as TCP determines the remote host is dead and therefore raises an error.
There's a lot more information on TCP retransmission and keep alive here and here