Background
I'm using Spring cloud Brixton.RC2, with Zuul and Eureka.
I have one gateway service with @EnableZuulProxy
and a book-service
with a status
method. Via configuration I can emulate work on the status
method by sleeping a defined amount of time.
The Zuul route is simple
zuul.routes.foos.path=/foos/**
zuul.routes.foos.serviceId=reservation-service
I run two instances of the book-service
. When I set the sleeping time below the Hystrix timeout threshold (1000ms) I can see requests going to both instance of the book services. This works well.
Problem
I understand that if the Hystrix command fails, it should be possible for Ribbon to retry the command on a different server. This should make the failure transparent to the client.
I read the Ribbon configuration and added the following configuration in Zuul:
zuul.routes.reservation-service.retryable=true //not sure which one to try
zuul.routes.foos.retryable=true //not sure which one to try
ribbon.MaxAutoRetries=0 // I don't want to retry on the same host, I also tried with 1 it doesn't work either
ribbon.MaxAutoRetriesNextServer=2
ribbon.OkToRetryOnAllOperations=true
Now I update the configuration so that only one service sleeps for more than 1s, which means that I have one health service, and one bad one.
When I call the gateways the calls get send to both instances, and half the calls returns a 500. In the gateway I see the Hystrix timeout:
com.netflix.zuul.exception.ZuulException: Forwarding error
[...]
Caused by: com.netflix.hystrix.exception.HystrixRuntimeException: reservation-service timed-out and no fallback available.
[...]
Caused by: java.util.concurrent.TimeoutException: null
Why isn't Ribbon retrying the call on the other instance?
Am I Missing something here?
References
- Relates to this question (not solved)
- Ribbon configuration
- According to this commit Zuul should support retries for Ribbon
By default Zuul uses SEMAPHORE isolation strategy which doesn't allow to set a timeout. I haven't been able to use load balancing with this stragegy. What worked for me was (following your example):
1) Changing Zuul's isolation to THREAD:
hystrix:
command:
reservation-service:
execution:
isolation:
strategy: THREAD
thread:
timeoutInMilliseconds: 100000
IMPORTANT: timeoutInMilliseconds= 100000 is like saying no HystrixTimeout. Why? Because if Hystrix times out there won't be any load balancing (I just tested it playing with timeoutInMilliseconds)
Then, configure Ribbon's ReadTimeout to the desired value:
reservation-service:
ribbon:
ReadTimeout: 800
ConnectTimeout: 250
OkToRetryOnAllOperations: true
MaxAutoRetriesNextServer: 2
MaxAutoRetries: 0
In this case after the 1sec service times out in Ribbon it'll retry with the 500ms service
Below you have the log I got in my zuul instance:
o.s.web.servlet.DispatcherServlet : DispatcherServlet with name 'dispatcherServlet' processing GET request for [/api/stories]
o.s.web.servlet.DispatcherServlet : Last-Modified value for [/api/stories] is: -1
c.n.zuul.http.HttpServletRequestWrapper : Path = null
c.n.zuul.http.HttpServletRequestWrapper : Transfer-Encoding = null
c.n.zuul.http.HttpServletRequestWrapper : Content-Encoding = null
c.n.zuul.http.HttpServletRequestWrapper : Content-Length header = -1
c.n.loadbalancer.ZoneAwareLoadBalancer : Zone aware logic disabled or there is only one zone
c.n.loadbalancer.LoadBalancerContext : storyteller-api using LB returned Server: localhost:7799 for request /api/stories
---> ATTEMPTING THE SLOW SERVICE
com.netflix.niws.client.http.RestClient : RestClient sending new Request(GET: ) http://localhost:7799/api/stories
c.n.http4.MonitoredConnectionManager : Get connection: {}->http://localhost:7799, timeout = 250
com.netflix.http4.NamedConnectionPool : [{}->http://localhost:7799] total kept alive: 1, total issued: 0, total allocated: 1 out of 200
com.netflix.http4.NamedConnectionPool : No free connections [{}->http://localhost:7799][null]
com.netflix.http4.NamedConnectionPool : Available capacity: 50 out of 50 [{}->http://localhost:7799][null]
com.netflix.http4.NamedConnectionPool : Creating new connection [{}->http://localhost:7799]
com.netflix.http4.NFHttpClient : Attempt 1 to execute request
com.netflix.http4.NFHttpClient : Closing the connection.
c.n.http4.MonitoredConnectionManager : Released connection is not reusable.
com.netflix.http4.NamedConnectionPool : Releasing connection [{}->http://localhost:7799][null]
com.netflix.http4.NamedConnectionPool : Notifying no-one, there are no waiting threads
--- HERE'S RIBBON'S TIMEOUT
c.n.l.reactive.LoadBalancerCommand : Got error com.sun.jersey.api.client.ClientHandlerException: java.net.SocketTimeoutException: Read timed out when executed on server localhost:7799
c.n.loadbalancer.ZoneAwareLoadBalancer : Zone aware logic disabled or there is only one zone
c.n.loadbalancer.LoadBalancerContext : storyteller-api using LB returned Server: localhost:9977 for request /api/stories
---> HERE IT RETRIES
com.netflix.niws.client.http.RestClient : RestClient sending new Request(GET: ) http://localhost:9977/api/stories
c.n.http4.MonitoredConnectionManager : Get connection: {}->http://localhost:9977, timeout = 250
com.netflix.http4.NamedConnectionPool : [{}->http://localhost:9977] total kept alive: 1, total issued: 0, total allocated: 1 out of 200
com.netflix.http4.NamedConnectionPool : Getting free connection [{}->http://localhost:9977][null]
com.netflix.http4.NFHttpClient : Stale connection check
com.netflix.http4.NFHttpClient : Attempt 1 to execute request
com.netflix.http4.NFHttpClient : Connection can be kept alive indefinitely
c.n.http4.MonitoredConnectionManager : Released connection is reusable.
com.netflix.http4.NamedConnectionPool : Releasing connection [{}->http://localhost:9977][null]
com.netflix.http4.NamedConnectionPool : Pooling connection [{}->http://localhost:9977][null]; keep alive indefinitely
com.netflix.http4.NamedConnectionPool : Notifying no-one, there are no waiting threads
o.s.web.servlet.DispatcherServlet : Null ModelAndView returned to DispatcherServlet with name 'dispatcherServlet': assuming HandlerAdapter completed request handling
o.s.web.servlet.DispatcherServlet : Successfully completed request
o.s.web.servlet.DispatcherServlet : DispatcherServlet with name 'dispatcherServlet' processing GET request for [/favicon.ico]
o.s.w.s.handler.SimpleUrlHandlerMapping : Matching patterns for request [/favicon.ico] are [/**/favicon.ico]
o.s.w.s.handler.SimpleUrlHandlerMapping : URI Template variables for request [/favicon.ico] are {}
o.s.w.s.handler.SimpleUrlHandlerMapping : Mapping [/favicon.ico] to HandlerExecutionChain with handler [ResourceHttpRequestHandler [locations=[ServletContext resource [/], class path resource [META-INF/resources/], class path resource [resources/], class path resource [static/], class path resource [public/], class path resource []], resolvers=[org.springframework.web.servlet.resource.PathResourceResolver@a0d875d]]] and 1 interceptor
o.s.web.servlet.DispatcherServlet : Last-Modified value for [/favicon.ico] is: -1
o.s.web.servlet.DispatcherServlet : Null ModelAndView returned to DispatcherServlet with name 'dispatcherServlet': assuming HandlerAdapter completed request handling
o.s.web.servlet.DispatcherServlet : Successfully completed request