Is there anything aside from setting Secondaries=1
in the cluster configuration to enable HighAvailability, specifically on the cache client configuration?
Our configuration:
- Cache Cluster (3 windows enterprise hosts using a SQL configuration provider):
- Cache Clients
With the about configuration, we see primary and secondary regions created on the three hosts, however when one of the hosts is stopped, the following exceptions occur:
ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.
An existing connection was forcibly closed by the remote host
No connection could be made because the target machine actively refused it 192.22.0.34:22233
An existing connection was forcibly closed by the remote host
Isn't the point of High Availability to be able to handle hosts going down without interrupting service? We are using a named region - does this break the High Availability? I read somewhere that named regions can only exist on one host (I did verify that a secondary does exist on another host). I feel like we're missing something for the cache client configuration would enable High Availability, any insight on the matter would be greatly appreciated.
After opening a ticket with Microsoft we narrowed it down to having a static
DataCacheFactory
object.Looking at the
tracelog
s from AppFabric, the clients are still trying to connect to all the hosts without handling hosts going down. Resetting IIS on the clients forces a newDataCacheFactory
to be created (in ourApp_Start
) and stops the exceptions.The MS engineers agreed that this approach was the best practices way (we also found several articles about this: see link and link)
They are continuing to investigate a solution for us. In the mean time we have come up with the following temporary workaround where we force a new
DataCacheFactory
object to be created in the event that we encounter one of the above exceptions.Will update this thread when we learn more.
High Availability is about protecting the data, not making it available every second (hence the retry exceptions). When a cache host goes down, you get an exception and are supposed to retry. During that time, access to HA cache's may throw a retry exception back to you while it is busy shuffling around and creating an extra copy. Regions complicate this more since it causes a larger chunk to have to be copied before it is HA again.
Also the client keeps a connection to all cache hosts so when one goes down it throws up the exception that something happened.
Basically when one host goes down, Appfabric freaks out until two copies of all data exist again in the HA cache's. We created a small layer in front of it to handle this logic and dropped the servers one at a time to make sure it handled all scenarios so that our app kept working but just was a tad bit slower.