WCF timeout exception detailed investigation

2020-01-26 16:24发布

问题:

We have an application that has a WCF service (*.svc) running on IIS7 and various clients querying the service. The server is running Win 2008 Server. The clients are running either Windows 2008 Server or Windows 2003 server. I am getting the following exception, which I have seen can in fact be related to a large number of potential WCF issues.

System.TimeoutException: The request channel timed out while waiting for a reply after 00:00:59.9320000. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: The HTTP request to 'http://www.domain.com/WebServices/myservice.svc/gzip' has exceeded the allotted timeout of 00:01:00. The time allotted to this operation may have been a portion of a longer timeout. 

I have increased the timeout to 30min and the error still occurred. This tells me that something else is at play, because the quantity of data could never take 30min to upload or download.

The error comes and goes. At the moment, it is more frequent. It does not seem to matter if I have 3 clients running simultaneously or 100, it still occurs once in a while. Most of the time, there are no timeouts but I still get a few per hour. The error comes from any of the methods that are invoked. One of these methods does not have parameters and returns a bit of data. Another takes in lots of data as a parameter but executes asynchronously. The errors always originate from the client and never reference any code on the server in the stack trace. It always ends with:

 at System.Net.HttpWebRequest.GetResponse()
  at System.ServiceModel.Channels.HttpChannelFactory.HttpRequestChannel.HttpChannelRequest.WaitForReply(TimeSpan timeout)

On the server: I've tried (and currently have) the following binding settings:

maxBufferSize="2147483647" maxReceivedMessageSize="2147483647" maxBufferPoolSize="2147483647"

It does not seem to have an impact.

I've tried (and currently have) the following throttling settings:

<serviceThrottling maxConcurrentCalls="1500"   maxConcurrentInstances="1500"    maxConcurrentSessions="1500"/>

It does not seem to have an impact.

I currently have the following settings for the WCF service.

[ServiceBehavior(InstanceContextMode = InstanceContextMode.Single, ConcurrencyMode = ConcurrencyMode.Single)]

I ran with ConcurrencyMode.Multiple for a while, and the error still occurred.

I've tried restarting IIS, restarting my underlying SQL Server, restarting the machine. All of these don't seem to have an impact.

I've tried disabling the Windows firewall. It does not seem to have an impact.

On the client, I have these settings:

maxReceivedMessageSize="2147483647"

<system.net>
    <connectionManagement>
    <add address="*" maxconnection="16"/>
</connectionManagement> 
</system.net>

My client closes its connections:

var client = new MyClient();

try
{
    return client.GetConfigurationOptions();
}
finally
{
    client.Close();
}

I have changed the registry settings to allow more outgoing connections:

MaxConnectionsPerServer=24, MaxConnectionsPer1_0Server=32.

I have now just recently tried SvcTraceViewer.exe. I managed to catch one exception on the client end. I see that its duration is 1 minute. Looking at the server side trace, I can see that the server is not aware of this exception. The maximum duration I can see is 10 seconds.

I have looked at active database connections using exec sp_who on the server. I only have a few (2-3). I have looked at TCP connections from one client using TCPview. It usually is around 2-3 and I have seen up to 5 or 6.

Simply put, I am stumped. I have tried everything I could find, and must be missing something very simple that a WCF expert would be able to see. It is my gut feeling that something is blocking my clients at the low-level (TCP), before the server actually receives the message and/or that something is queuing the messages at the server level and never letting them process.

If you have any performance counters I should look at, please let me know. (please indicate what values are bad, as some of these counters are hard to decypher). Also, how could I log the WCF message size? Finally, are there any tools our there that would allow me to test how many connections I can establish between my client and server (independently from my application)

Thanks for your time!

Extra information added June 20th:

My WCF application does something similar to the following.

while (true)
{
   Step1GetConfigurationSettingsFromServerViaWCF(); // can change between calls
   Step2GetWorkUnitFromServerViaWCF();
   DoWorkLocally(); // takes 5-15minutes. 
   Step3SendBackResultsToServerViaWCF();
}

Using WireShark, I did see that when the error occurs, I have a five TCP retransmissions followed by a TCP reset later on. My guess is the RST is coming from WCF killing the connection. The exception report I get is from Step3 timing out.

I discovered this by looking at the tcp stream "tcp.stream eq 192". I then expanded my filter to "tcp.stream eq 192 and http and http.request.method eq POST" and saw 6 POSTs during this stream. This seemed odd, so I checked with another stream such as tcp.stream eq 100. I had three POSTs, which seems a bit more normal because I am doing three calls. However, I do close my connection after every WCF call, so I would have expected one call per stream (but I don't know much about TCP).

Investigating a bit more, I dumped the http packet load to disk to look at what these six calls where.

1) Step3
2) Step1
3) Step2
4) Step3 - corrupted
5) Step1
6) Step2

My guess is two concurrent clients are using the same connection, that is why I saw duplicates. However, I still have a few more issues that I can't comprehend:

a) Why is the packet corrupted? Random network fluke - maybe? The load is gzipped using this sample code: http://msdn.microsoft.com/en-us/library/ms751458.aspx - Could the code be buggy once in a while when used concurrently? I should test without the gzip library.

b) Why would I see step 1 & step 2 running AFTER the corrupted operation timed out? It seems to me as if these operations should not have occurred. Maybe I am not looking at the right stream because my understanding of TCP is flawed. I have other streams that occur at the same time. I should investigate other streams - a quick glance at streams 190-194 show that the Step3 POST have proper payload data (not corrupted). Pushing me to look at the gzip library again.

回答1:

If you are using .Net client then you may not have set

//This says how many outgoing connection you can make to a single endpoint. Default Value is 2
System.Net.ServicePointManager.DefaultConnectionLimit = 200;

here is the original question and answer WCF Service Throttling

Update:

This config goes in .Net client application may be on start up or whenever but before starting your tests.

Moreover you can have it in app.config file as well like following

<system.net>
    <connectionManagement>
      <add maxconnection = "200" address ="*" />
    </connectionManagement>
  </system.net>


回答2:

If you havn't tried it already - encapsulate your Server-side WCF Operations in try/finally blocks, and add logging to ensure they are actually returning.

If those show that the Operations are completing, then my next step would be to go to a lower level, and look at the actual transport layer.

Wireshark or another similar packet capturing tool can be quite helpful at this point. I'm assuming this is running over HTTP on standard port 80.

Run Wireshark on the client. In the Options when you start the capture, set the capture filter to tcp http and host service.example.com - this will reduce the amount of irrelevant traffic.

If you can, modify your client to notify you the exact start time of the call, and the time when the timeout occurred. Or just monitor it closely.

When you get an error, then you can trawl through the Wireshark logs to find the start of the call. Right click on the first packet that has your client calling out on it (Should be something like GET /service.svc or POST /service.svc) and select Follow TCP Stream.

Wireshark will decode the entire HTTP Conversation, so you can ensure that WCF is actually sending back responses.



回答3:

from: http://www.codeproject.com/KB/WCF/WCF_Operation_Timeout_.aspx

To avoid this timeout error, we need to configure the OperationTimeout property for Proxy in the WCF client code. This configuration is something new unlike other configurations such as Send Timeout, Receive Timeout etc., which I discussed early in the article. To set this operation timeout property configuration, we have to cast our proxy to IContextChannel in WCF client application before calling the operation contract methods.



回答4:

I'm having a very similar problem. In the past, this has been related to serialization problems. If you are still having this problem, can you verify that you can correctly serialize the objects you are returning. Specifically, if you are using Linq-To-Sql objects that have relationships, there are known serialization problems if you put a back reference on a child object to the parent object and mark that back reference as a DataMember.

You can verify serialization by writing a console app that serializes and deserializes your objects using the DataContractSerializer on the server side and whatever serialization methods your client uses. For example, in our current application, we have both WPF and Compact Framework clients. I wrote a console app to verify that I can serialize using a DataContractSerializer and deserialize using an XmlDesserializer. You might try that.

Also, if you are returning Linq-To-Sql objects that have child collections, you might try to ensure that you have eagerly loaded them on the server side. Sometimes, because of lazy loading, the objects being returned are not populated and may cause the behavior you are seeing where the request is sent to the service method multiple times.

If you have solved this problem, I'd love to hear how because I'm stuck with it too. I have verified that my issue is not serialization so I'm at a loss.

UPDATE: I'm not sure if it will help you any but the Service Trace Viewer Tool just solved my problem after 5 days of very similar experience to yours. By setting up tracing and then looking at the raw XML, I found the exceptions that were causing my serialization problems. It was related to Linq-to-SQL objects that occasionally had more child objects than could be successfully serialized. Adding the following to your web.config file should enable tracing:

<sharedListeners>
    <add name="sharedListener"
         type="System.Diagnostics.XmlWriterTraceListener"
         initializeData="c:\Temp\servicetrace.svclog" />
  </sharedListeners>
  <sources>
    <source name="System.ServiceModel" switchValue="Verbose, ActivityTracing" >
      <listeners>
        <add name="sharedListener" />
      </listeners>
    </source>
    <source name="System.ServiceModel.MessageLogging" switchValue="Verbose">
      <listeners>
        <add name="sharedListener" />
      </listeners>
    </source>
  </sources>

The resulting file can be opened with the Service Trace Viewer Tool or just in IE to examine the results.



回答5:

Are you closing the connection to the WCF service in between requests? If you don't, you'll see this exact timeout (eventually).



回答6:

I've just solved the problem.I found that the nodes in the App.config file have configed wrong.

<client>
<endpoint name="WCF_QtrwiseSalesService" binding="wsHttpBinding" bindingConfiguration="ws" address="http://cntgbs1131:9005/MyService/TGE.ISupplierClientManager" contract="*">
</endpoint>
</client>

<bindings>
    <wsHttpBinding>
        <binding name="ws" maxBufferPoolSize="2147483647" maxReceivedMessageSize="2147483647" messageEncoding="Text">
            <readerQuotas maxDepth="2147483647" maxStringContentLength="2147483647" maxArrayLength="2147483647" maxBytesPerRead="2147483647" maxNameTableCharCount="2147483647"/>
            <**security mode="None">**
                <transport clientCredentialType="None"></transport>
            </security>
        </binding>
    </wsHttpBinding>
</bindings>

Confirm your config in the node <security>,the attribute "mode" value is "None". If your value is "Transport",the error occurs.



回答7:

Did you try using clientVia to see the message sent, using SOAP toolkit or something like that? This could help to see if the error is coming from the client itself or from somewhere else.



回答8:

Did you check the WCF traces? WCF has a tendency to swallow exceptions and only return the last exception, which is the timeout that you're getting, since the end point didn't return anything meaningful.



回答9:

You will also receive this error if you are passing an object back to the client that contains a property of type enum that is not set by default and that enum does not have a value that maps to 0. i.e enum MyEnum{ a=1, b=2};



回答10:

Looks like this exception message is quite generic and can be received due to a variety of reasons. We ran into this while deploying the client on Windows 8.1 machines. Our WCF client runs inside of a windows service and continuously polls the WCF service. The windows service runs under a non-admin user. The issue was fixed by setting the clientCredentialType to "Windows" in the WCF configuration to allow the authentication to pass-through, as in the following:

      <security mode="None">
        <transport clientCredentialType="Windows" proxyCredentialType="None"
          realm="" />
        <message clientCredentialType="UserName" algorithmSuite="Default" />
      </security>


回答11:

I'm not a WCF expert but I'm wondering if you aren't running into a DDOS protection on IIS. I know from experience that if you run a bunch of simultaneous connections from a single client to a server at some point the server stops responding to the calls as it suspects a DDOS attack. It will also hold the connections open until they time-out in order to slow the client down in his attacks.

Multiple connection coming from different machines/IP's should not be a problem however.

There's more info in this MSDN post:

http://msdn.microsoft.com/en-us/library/bb463275.aspx

Check out the MaxConcurrentSession sproperty.