Stackexchange.redis: Transient fault tolerance and implicit retry support discussion

Created on 28 Apr 2016  ·  5Comments  ·  Source: StackExchange/StackExchange.Redis

The Problem

We're a heavy user of Azure Redis Cache; and the platform will sometimes (eg once a month) reboot the underlying host OS for platform updates, causing our primary redis cache instance to go down. The secondary instance Azure runs takes over, but only after a moment of several command failures.

When these events happen, sockets are disconnected, commands fail, and timeouts momentarily occur and SE.Redis rightfully has to throw that exception.

Here's an example of the exceptions we may see during this time:

[RedisConnectionException: SocketFailure on SMEMBERS]
...
Message: [IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.]
System.Net.Security._SslStream.EndRead(IAsyncResult asyncResult):174
StackExchange.Redis.PhysicalConnection.EndReading(IAsyncResult result):17
Message: [SocketException: An existing connection was forcibly closed by the remote host]
System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult):99

And:

Message: [RedisConnectionException: No connection is available to service this operation: EXEC]
StackExchange.Redis.ConnectionMultiplexer.ThrowFailed[T](TaskCompletionSource1 source, Exception unthrownException):0 System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw():12 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task):41 System.Runtime.CompilerServices.TaskAwaiter1.GetResult():11
... app code ...

And:

Message: [TimeoutException: Timeout performing EXEC, inst: 1, mgr: Inactive, err: never, queue: 102, qu: 0, qs: 102, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=3,Free=32764,Min=4,Max=32767), clientName: RD0003FFAD174E]
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImplT:34
StackExchange.Redis.RedisTransaction.Execute(CommandFlags flags):14

Solution proposal

Azure, Amazon and other cloud services all toe the line that things need to handle transient faults. SQL Azure drivers handle this with built-in retry support on the app-code side, but the recommended driver (SE.Redis) has no command retry support to shield app code from these transient faults.

It's not an easy problem to solve: Not all commands should necessarily be retryable as they are not idempotent (eg, INCR operations) that may or may not have succeeded.

I am wondering what the thoughts are about how to best approach a solution to this problem.

Driver level support by SE.Redis that could have a "command retry" option that blindly retries all commands (or optionally only "idempotent commands" like SET or GET) upon any form of connection/timeout failure for up to X retries would be a pretty good initial solution.

Ideally, there was perhaps a Redis-level "This server is shutting down" command that could warn the driver to pause sending any commands for a few moments while the underlying secondary takes over would be better, but that's a more co-ordinated solution also involving the Redis team.

Thoughts?

connection enhancement Azure timeout

Most helpful comment

bump for this question.
Looking for guidance on retrying SET or GET operations in the event there is a temporary network issue that causes a SocketException.

We are migrating from ServiceStack to StackExchange client and in the code we are replacing, which used ServiceStack, we caught exceptions and would retry operations after a short thread.sleep. On most occasions the retry would work.

If there is a network issue that causes a System.Net.SocketException such as "An established connection was aborted by the software in your host machine" or "An existing connection was forcibly closed by the remote host" does StackExchange.Redis automatically retry up until the syncTimeout time has elapsed?

If not, are there any suggested steps that should happen between the initial failure and a retry in our code? Such as:

  • recreating the multiplexer? (I'd guess not)
  • waiting a short amount of time?
  • calling Close() and Configure()?

Just for clarification, I am talking about network issues when attempting StringSet or StringGet. Not when trying to initially connect to the Redis server.

All 5 comments

There is retry mechanism in SE.Redis, we can use below options in the connection string:

abortConnect=false
connectRetry=3000
connectTimeout=600000
syncTimeout=600000

Below is the sample connection string:

"contoso.redis.cache.windows.net,abortConnect=false,connectRetry=3000,connectTimeout=600000,syncTimeout=600000,ssl=true,password=weweweweweZNw1L4bIo0DgPxD9ytdwewe="

Using : StackExchange.Redis and ServiceStack.Redis

For documentation about the "connectRetry" parameter, I found this :
// RetryTimeout = 3000, (default 3000ms) // To improve the resilience of client connections, RedisClient will transparently retry failed Redis operations due to Socket and I/O Exceptions in an exponential backoff starting from 10ms up until the RetryTimeout of 3000ms. These defaults can be tweaked with: RedisConfig.DefaultRetryTimeout = 3000; RedisConfig.BackOffMultiplier = 10;

=> I already have the same issue at some times. But that was with the default value of "3000ms".

_This will not solve the problem exposed (not a transiant failure mecanism) but could help a little ? ._
I was wondering about the impact of changing that default value to a higher one (5 seconds for instance). I don't want it to decrease performance if this is happening at a higher rate than expected.

connectRetryspecifies the number of connect attempts during initial connect and is not about the time, abortConnectspecifies whether retries should happen at all, connectTimeoutand sycTimeoutare for timeout of connect and sync operations respectively.

Also, you are using the Azure Redis Cache and the recommended client for the same is StackExchange.Redis but the documentation section that you pasted above seems to be from ServiceStack.Redis, can you please confirm, which Redis client you are using.

The original discussion was about retrying operations, not connections. It is my understanding that connectRetry, connectTimeout and abortConnect relate to retrying the actual connection, not the operation. Retrying the get/set operations would be extremely helpful and I'm currently looking for solutions before building one from scratch.

I can't find anything about syncTimeout.

bump for this question.
Looking for guidance on retrying SET or GET operations in the event there is a temporary network issue that causes a SocketException.

We are migrating from ServiceStack to StackExchange client and in the code we are replacing, which used ServiceStack, we caught exceptions and would retry operations after a short thread.sleep. On most occasions the retry would work.

If there is a network issue that causes a System.Net.SocketException such as "An established connection was aborted by the software in your host machine" or "An existing connection was forcibly closed by the remote host" does StackExchange.Redis automatically retry up until the syncTimeout time has elapsed?

If not, are there any suggested steps that should happen between the initial failure and a retry in our code? Such as:

  • recreating the multiplexer? (I'd guess not)
  • waiting a short amount of time?
  • calling Close() and Configure()?

Just for clarification, I am talking about network issues when attempting StringSet or StringGet. Not when trying to initially connect to the Redis server.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

lgcardonanet picture lgcardonanet  ·  4Comments

sudheergn picture sudheergn  ·  5Comments

deepaknc picture deepaknc  ·  5Comments

wreckedpc picture wreckedpc  ·  3Comments

KennethRMason picture KennethRMason  ·  4Comments