Service Fabric Actor or Service Becomes Inaccessib

2019-03-15 10:12发布

问题:

After upgrading from Service Fabric SDK 2.0.135 to 2.3.301, we have started encountering situations where a Service Fabric actor or service is inaccessible in spite of showing as healthy in Service Fabric Explorer. Once in this state, any call to the actor or service via the ActorProxy or ServiceProxy will hang for 5 minutes before finally giving a TimeoutException. Once in this state, the actor or service never recovers on its own – even if left for an hour. The only solution is to reset the node(s) on which the actor or service resides, redeploy the actor or service (exact same EXE), reset the entire cluster or reboot all of the cluster machines.

It usually gets into this state after deploying or re-deploying a SF application.

In the last year of working with Service Fabric (since SDK v1.3), we have never had this problem. It only started after moving to 2.3.301.

It seems to happen randomly and inconsistently. Which of our 13 SF applications within our solution get effected is also random.

Does anyone have any ideas on how we might be able to resolve this? It seems like a bug in the latest version of Service Fabric but perhaps we are doing something wrong on our end.

Any help is appreciated.

Below is a lot of extra information that I hope will be useful in understanding what we're facing with this issue.

Many thanks

Steps

I don't really have steps to consistently reproduce the issue. This is simply what I observe sometimes.

  1. I compiled and then re-deployed my SF project from Visual Studio (Debug -> Start Without Debugging)
  2. Visual Studio says it successfully deployed the project
  3. Service Fabric Explorer shows all of my services as Healthy, including Data-Binding
  4. The SF project in question has 2 actors that are part of a single EXE. Service Fabric Explorer shows each of these actors running on different nodes.
  5. Windows Task Manager shows two running copies of the EXE, which makes sense since there are two nodes running the EXE.

Likewise, our QA experiences the issue after deploying to Azure using PowerShell directly. (He doesn't deploy from Visual Studio.)

To recap

  • Visual Studio says the deployment was successful
  • Service Fabric Explorer shows that everything is healthy
  • Task Manager shows two running copies of the EXE

When I See The Failure

I have one SF Service calling another SF Service using the ServiceProxy or ActorProxy classes. We do this throughout our solution with a combination of 13 different applications and about 25 different Services & Actors. It has worked successfully since we started working with Service Fabric SDK v1.3 in November 2015.

Now, after upgrading to 2.3.301, we have the periodic occurrence of a random Actor or Service getting into a state where it fails to respond to a call to a method when called from ServiceProxy or ActorProxy. After 5 minutes of hanging, we receive a System.Timeout exception with the following message:

This can happen if message is dropped when service is busy or its long running operation and taking more time than configured Operation Timeout.

Note that the service is NOT busy, nor is it performing a long-running operation. As an actor, the service doesn’t do any on-going operations at all. It simply exposes public methods that other services can consume. It fails from the very first call.

In fact, tracing shows us that even the first line of the method in the actor never gets called. It's as if the Service Fabric communication infrastructure fails to deliver the message.

When This Started

In the past 12 months, we had never seen this issue.

Now, we are seeing this issue frequently and under a variety of conditions since upgrading Service Fabric last week.

We upgrade to Service Fabric SDK 2.3.301.9590 and Service Fabric 5.3.301.9590.

At first, each developer in the team encountered the issue independently and each thought it was a transient issue with just our machines. Service Fabric does have some issues so we just accept this and move on. But then we started to complain to each other and realized that we are all seeing it. Even our QAs are seeing it in the cloud on our environment that is soon to be production.

Again, this only started when we upgraded to the latest version of Service Fabric last week.

Previously, we were running Service Fabric SDK 2.0.135.

We upgraded our codebase by installing SDK v 2.3.301, opening each of our solutions and allowing Visual Studio to conduct the upgrade.

The Environment

I’m running a fresh install of Windows 10 Enterprise (installed it less than 2 weeks ago) on an i7 with 16 gigs of RAM. I have a fresh install of Visual Studio 2015 Update 3 and SF 2.3.301.9590. I installed everything clean. No upgrades.

This is also happening on all of my colleagues machines (of varying ages, configurations and “freshnesses”). It happens sporadically to each of us.

Most critically, this is also happening on our Service Fabric VMs on Azure. These are machines that our QA created about a month ago using the standard templates for Service Fabric VMs on Azure. It had 5.3.301.9590 pre-installed. He did not manually install any updates to Service Fabric. Our SF-based application did not encounter this problem on Azure (or our own dev machines) until after the developers upgraded to the new version.

This is not a my machine thing, nor is it isolated to just the development environment. The only consistent change for all of us is the update of the SF version.

The Cause

We have no idea what causes it.

It usually happens immediately after deploying a new SF application. Yes, we do wait for the usual 2 or 3 minutes it takes for SF to "figure itself out" after deploying. We have left it for an hour or more and it just never works.

Anecdotally, I think I've had a SF Service that was working fine and then suddenly stopped working but this was before we realized there was an issue so I wasn't looking for it. I can't be certain.

The Work-Arounds

Once we have a SF service in that “inaccessible” state, Service Fabric will not get itself back out of that state again. The application is completely unusable. With varying degrees of success, we do the following:

  • Re-deploy the inaccessible SF application
  • Restart the nodes (through Service Fabric Explorer by going to the node, clicking the ellipsis button and clicking the “Restart” option) that host the inaccessible SF services & actors
  • Restart the entire SF cluster (Stop then Start)
  • Restart all of the machines running a SF node
  • Reset the entire cluster and re-deploy everything (last resort but it has been necessary a few times)

Interestingly, what does not help is using Task Manager to kill the offending processes. If I kill the offending process, Service Fabric restarts it (as expected) but it still won't respond to messages.

Thus, the issue seems to be with Service Fabric itself and not with the EXEs.

Of course, these aren’t “solutions” at all because they leave our entire application inaccessible until SF can restart/rebalance. Even restarting a few of the nodes knocks a bunch of stuff off-line.

Essentially, this is a show-stopper for us. We can’t possibly put our application into production (or even beta) with Service Fabric behaving like this.

The C# Exception when Using the Service Proxy or Actor Proxy:

JSON rendering of the Exception Thrown by ActorProxy or ServicePRoxy

"exception": {
    "ClassName": "System.TimeoutException",
    "Message": "This can happen if message is dropped when service is busy or its long running operation and taking more time than configured Operation Timeout.",
    "Data": null,
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   at Microsoft.ServiceFabric.Services.Communication.Client.ServicePartitionClient`1.<InvokeWithRetryAsync>d__7`1.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Client.ServiceRemotingPartitionClient.<InvokeAsync>d__8.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Builder.ProxyBase.<InvokeAsync>d__0.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.ServiceFabric.Services.Remoting.Builder.ProxyBase.<ContinueWithResult>d__7`1.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()\r\n   at RenderingCachingEngine.RenderingCachingEngine.<Render>d__10.MoveNext() in C:\\Code\\Ink\\Dev\\Current\\Source\\Rendering Service Fabric\\RenderingCachingEngine\\RenderingCachingEngine.cs:line 381",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": "8\nMoveNext\nMicrosoft.ServiceFabric.Services, Version=5.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35\nMicrosoft.ServiceFabric.Services.Communication.Client.ServicePartitionClient`1+<InvokeWithRetryAsync>d__7`1\nVoid MoveNext()",
    "HResult": -2146233083,
    "Source": "Microsoft.ServiceFabric.Services",
    "WatsonBuckets": null
  }

Here is a JSON rendering of the Service Fabric Info:

  "serviceFabricInfo": {
    "serviceFabricServiceName": "fabric:/Rendering/RenderingCachingEngine",
    "serviceFabricServiceTypeName": "RenderingCachingEngineType",
    "serviceFabricReplicaId": 131225099453058851,
    "serviceFabricPartitionId": "e400087d-8a08-4dab-bcdd-1f5ce82f374f",
    "serviceFabricApplicationName": "fabric:/Rendering",
    "serviceFabricApplicationTypeName": "RenderingType",
    "serviceFabricNodeName": "_Node_4"
  }

The Event Viewer Logs When Re-Deploying

Windows Event Viewer does show some note-worthy logs under “Applications and Services Logs -> Microsoft-Service Fabric -> Admin”.

The following logs happened while I was re-deploying an updated version of my application (note that DataBinding.exe is the name of the EXE containing my two SF actors):

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:53 PM
Event ID:      256
Task Category: Common
Level:         Error
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
WriteNode failed. HRESULT=-2147467259, Output=CustomOutput
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>256</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>1</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:53.678587200Z" />
    <EventRecordID>7620</EventRecordID>
    <Correlation />
    <Execution ProcessID="4440" ThreadID="7360" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">XmlLiteWriter</Data>
    <Data Name="text">WriteNode failed. HRESULT=-2147467259, Output=CustomOutput</Data>
  </EventData>
</Event>

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:54 PM
Event ID:      23073
Task Category: Hosting
Level:         Warning
Keywords:      Default
User:          SYSTEM
Computer:      shayward10.ovx.local
Description:
ServiceHostProcess: DataBinding.exe for ApplicationId 805915c7-456c-49d3-af95-62cc44650664 terminated unexpectedly with exit code 3221225786 on node id bf865279ba277deb864a976fbf4c200e
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>23073</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>90</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:54.820567800Z" />
    <EventRecordID>7621</EventRecordID>
    <Correlation />
    <Execution ProcessID="6944" ThreadID="3812" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="id">bf865279ba277deb864a976fbf4c200e</Data>
    <Data Name="AppId">805915c7-456c-49d3-af95-62cc44650664</Data>
    <Data Name="ReturnCode">3221225786</Data>
    <Data Name="ProcessName">DataBinding.exe</Data>
  </EventData>
</Event>

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:38:56 PM
Event ID:      256
Task Category: Common
Level:         Error
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
WriteNode failed. HRESULT=-2147467259, Output=CustomOutput
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>256</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>1</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:38:56.261857600Z" />
    <EventRecordID>7627</EventRecordID>
    <Correlation />
    <Execution ProcessID="4440" ThreadID="8564" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
  <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">XmlLiteWriter</Data>
    <Data Name="text">WriteNode failed. HRESULT=-2147467259, Output=CustomOutput</Data>
  </EventData>
</Event>

The Event Viewer Logs when it Times Out

Once the service is in an inaccessible state, trying to call it yields the following log on each request (after waiting for 5 minutes):

Log Name:      Microsoft-ServiceFabric/Admin
Source:        Microsoft-ServiceFabric
Date:          11/2/2016 2:44:55 PM
Event ID:      44289
Task Category: FabricTransport
Level:         Warning
Keywords:      Default
User:          NETWORK SERVICE
Computer:      shayward10.ovx.local
Description:
Error While Sending Message : FABRIC_E_TIMEOUT
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-ServiceFabric" Guid="{CBD93BC2-71E5-4566-B3A7-595D8EECA6E8}" />
    <EventID>44289</EventID>
    <Version>0</Version>
    <Level>3</Level>
    <Task>173</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000001</Keywords>
    <TimeCreated SystemTime="2016-11-02T18:44:55.349048200Z" />
    <EventRecordID>7629</EventRecordID>
    <Correlation />
    <Execution ProcessID="18600" ThreadID="8076" />
    <Channel>Microsoft-ServiceFabric/Admin</Channel>
    <Computer>shayward10.ovx.local</Computer>
    <Security UserID="S-1-5-20" />
  </System>
 <EventData>
    <Data Name="id">
    </Data>
    <Data Name="type">ServiceCommunicationClient</Data>
    <Data Name="text">Error While Sending Message : FABRIC_E_TIMEOUT</Data>
  </EventData>
</Event>

回答1:

This issue can happen in 2 scenarios.

  1. If your ActorService method processing is taking more than the default timeout, then you need to change OperationTimeout value. By default it is 5 minutes. If you want to change the timeout, you can change it by adding assembly FabricTransportServiceRemotingProviderAttribute in your client assembly.

https://msdn.microsoft.com/en-us/library/microsoft.servicefabric.services.remoting.fabrictransport.fabrictransportserviceremotingproviderattribute.aspx

  1. If first scenario is not the case, then you can try below mitigation for a known bug.
    • Specify Port 0 in the Service Manifest for the ActorService endpoint. By default, ActorEndpoint will be listed in ServiceManifest but port won’t be there.

This is how it will look for ActorService after you make change.

<Endpoint Name="Actor1ActorServiceEndpoint" Port="0" />

We are aware of the problem and a fix is on the way.



回答2:

In case it helps anyone we were seeing these timeouts on long running (over 5 minute) operations. Following Suchi's hint about the FabricTransportServiceRemotingProviderAttribute we added the following lines to our SF projects AssemblyInfo.cs to increase the timeout to 1 hour.

[assembly: FabricTransportServiceRemotingProvider(OperationTimeoutInSeconds = 3600)]
[assembly: FabricTransportActorRemotingProvider(OperationTimeoutInSeconds = 3600)]

(Also note if you're using Azure Service Buses the maximum lock time is 5 minutes, so you'll have to implement some lock renewal code to support long running operations)