We have a (long-running) Windows service that among other things periodically communicates with an FTP server embedded on a third-party device using FtpWebRequest. This works great most of the time, but sometimes our service stops communicating with the device, but as soon as you restart our service everything starts working again.
I've spent some time debugging this with an MCVE (included below) and discovered via Wireshark that once communication starts failing there is no network traffic going to the external FTP server (no packets at all show up going to this IP in Wireshark). If I try to connect to the same FTP from another application on the same machine like Windows explorer everything works fine.
Looking at the packets just before everything stops working I see packets with the reset (RST) flag set coming from the device, so I suspect this may be the issue. Once some part of the network stack on the computer our service in running on receives the reset packet it does what's described in the TCP resets section of this article and blocks all further communication from our process to the device.
As far as I can tell there's nothing wrong with the way we're communicating with the device, and most of the time the exact same code works just fine. The easiest way to reproduce the issue (see MCVE below) seems to be to make a lot of separate connections to the FTP at the same time, so I suspect the issue may occur when there are a lot of connections being made to the FTP (not all by us) at the same time.
The thing is that if we do restart our process everything works fine, and we do need to re-establish communication with the device. Is there a way to re-establish communication (after a suitable amount of time has passed) without having to restart the entire process?
Unfortunately the FTP server is running embedded on a fairly old third-party device that's not likely to be updated to address this issue, and even if it were we'd still want/need to communicate with all the ones already out in the field without requiring our customers to update them if possible.
Options we are aware of:
Using a command line FTP client such as the one built into Windows.
- One downside to this is that we need to list all the files in a directory and then download only some of them, so we'd have to write logic to parse the response to this.
- We'd also have to download the files to a temp file instead of to a stream like we do now.
Creating another application that handles the FTP communication part that we tear down after each request completes.
- The main downside here is that inter-process communication is a bit of a pain.
MCVE
This runs in LINQPad and reproduces the issue fairly reliably. Typically the first several tasks succeed and then the issue occurs, and after that all tasks start timing out. In Wireshark I can see that no communication between my computer and the device is happening.
If I run the script again then all tasks fail until I restart LINQPad or do "Cancel All Threads and Reset" which restarts the process LINQPad uses to run the query. If I do either of those things then we're back to the first several tasks succeeding.
async Task Main() {
var tasks = new List<Task>();
var numberOfBatches = 3;
var numberOfTasksPerBatch = 10;
foreach (var batchNumber in Enumerable.Range(1, numberOfBatches)) {
$"Starting tasks in batch {batchNumber}".Dump();
tasks.AddRange(Enumerable.Range(1, numberOfTasksPerBatch).Select(taskNumber => Connect(batchNumber, taskNumber)));
await Task.Delay(TimeSpan.FromSeconds(5));
}
await Task.WhenAll(tasks);
}
async Task Connect(int batchNumber, int taskNumber) {
try {
var client = new FtpClient();
var result = await client.GetFileAsync(new Uri("ftp://192.168.0.191/logging/20140620.csv"), TimeSpan.FromSeconds(10));
result.Count.Dump($"Task {taskNumber} in batch {batchNumber} succeeded");
} catch (Exception e) {
e.Dump($"Task {taskNumber} in batch {batchNumber} failed");
}
}
public class FtpClient {
public virtual async Task<ImmutableList<Byte>> GetFileAsync(Uri fileUri, TimeSpan timeout) {
if (fileUri == null) {
throw new ArgumentNullException(nameof(fileUri));
}
FtpWebRequest ftpWebRequest = (FtpWebRequest)WebRequest.Create(fileUri);
ftpWebRequest.Method = WebRequestMethods.Ftp.DownloadFile;
ftpWebRequest.UseBinary = true;
ftpWebRequest.KeepAlive = false;
using (var source = new CancellationTokenSource(timeout)) {
try {
using (var response = (FtpWebResponse)await ftpWebRequest.GetResponseAsync()
.WithWaitCancellation(source.Token)) {
using (Stream ftpStream = response.GetResponseStream()) {
if (ftpStream == null) {
throw new InvalidOperationException("No response stream");
}
using (var dataStream = new MemoryStream()) {
await ftpStream.CopyToAsync(dataStream, 4096, source.Token)
.WithWaitCancellation(source.Token);
return dataStream.ToArray().ToImmutableList();
}
}
}
} catch (OperationCanceledException) {
throw new WebException(
String.Format("Operation timed out after {0} seconds.", timeout.TotalSeconds),
WebExceptionStatus.Timeout);
} finally {
ftpWebRequest.Abort();
}
}
}
}
public static class TaskCancellationExtensions {
/// http://stackoverflow.com/a/14524565/1512
public static async Task<T> WithWaitCancellation<T>(
this Task<T> task,
CancellationToken cancellationToken) {
// The task completion source.
var tcs = new TaskCompletionSource<Boolean>();
// Register with the cancellation token.
using (cancellationToken.Register(
s => ((TaskCompletionSource<Boolean>)s).TrySetResult(true),
tcs)) {
// If the task waited on is the cancellation token...
if (task != await Task.WhenAny(task, tcs.Task)) {
throw new OperationCanceledException(cancellationToken);
}
}
// Wait for one or the other to complete.
return await task;
}
/// http://stackoverflow.com/a/14524565/1512
public static async Task WithWaitCancellation(
this Task task,
CancellationToken cancellationToken) {
// The task completion source.
var tcs = new TaskCompletionSource<Boolean>();
// Register with the cancellation token.
using (cancellationToken.Register(
s => ((TaskCompletionSource<Boolean>)s).TrySetResult(true),
tcs)) {
// If the task waited on is the cancellation token...
if (task != await Task.WhenAny(task, tcs.Task)) {
throw new OperationCanceledException(cancellationToken);
}
}
// Wait for one or the other to complete.
await task;
}
}
This reminds me of old(?) IE behaviour of no reload of pages even when the network came back after N unsuccessful tries.
You should try setting the
FtpWebRequest
's cache policy toBypassCache
.after setting
KeepAlive
.