Istio envoy 504 gateway timeouts after 15 seconds

2020-02-11 17:41发布

问题:

Background:

I am using istio 1.2.5

I have deployed istio using helm default profile from the istio documentation by enabling the tracing, kiali and logLevel to "debug".

My pods and service in istio system namespace looks like this:

(⎈ |cluster-dev:default)➜  istio-1.2.5 git:(master) ✗ k pods -n istio-system
NAME                                      READY     STATUS    RESTARTS   AGE
grafana-97fb6966d-cv5fq                   1/1       Running   0          1d
istio-citadel-76f9586b8b-4bbcx            1/1       Running   0          1d
istio-galley-78f65c8469-v5cmn             1/1       Running   0          1d
istio-ingressgateway-5d5487c666-jjhb7     1/1       Running   0          1d
istio-pilot-65cb5648bf-4nfl7              2/2       Running   0          1d
istio-policy-8596cc6554-7sgzt             2/2       Running   0          1d
istio-sidecar-injector-76f487845d-ktf6p   1/1       Running   0          1d
istio-telemetry-5c6b6d59f6-lppqt          2/2       Running   0          1d
istio-tracing-595796cf54-zsnvj            1/1       Running   0          1d
kiali-55fcfc86cc-p2jrk                    1/1       Running   0          1d
prometheus-5679cb4dcd-h7qsj               1/1       Running   0          1d

(⎈ |cluster-dev:default)➜  istio-1.2.5 git:(master) ✗ k svc  -n istio-system
NAME                     TYPE           CLUSTER-IP       EXTERNAL-IP        PORT(S)                                                                                                                                      AGE
grafana                  ClusterIP      172.20.116.246   <none>             3000/TCP                                                                                                                                     1d
istio-citadel            ClusterIP      172.20.177.97    <none>             8060/TCP,15014/TCP                                                                                                                           1d
istio-galley             ClusterIP      172.20.162.16    <none>             443/TCP,15014/TCP,9901/TCP                                                                                                                   1d
istio-ingressgateway     LoadBalancer   172.20.199.160   xxxxxxxxxxxxx...   15020:31334/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:30200/TCP,15030:32111/TCP,15031:32286/TCP,15032:32720/TCP,15443:30857/TCP   1d
istio-pilot              ClusterIP      172.20.137.21    <none>             15010/TCP,15011/TCP,8080/TCP,15014/TCP                                                                                                       1d
istio-policy             ClusterIP      172.20.188.114   <none>             9091/TCP,15004/TCP,15014/TCP                                                                                                                 1d
istio-sidecar-injector   ClusterIP      172.20.47.238    <none>             443/TCP                                                                                                                                      1d
istio-telemetry          ClusterIP      172.20.77.52     <none>             9091/TCP,15004/TCP,15014/TCP,42422/TCP                                                                                                       1d
jaeger-agent             ClusterIP      None             <none>             5775/UDP,6831/UDP,6832/UDP                                                                                                                   1d
jaeger-collector         ClusterIP      172.20.225.255   <none>             14267/TCP,14268/TCP                                                                                                                          1d
jaeger-query             ClusterIP      172.20.181.245   <none>             16686/TCP                                                                                                                                    1d
kiali                    ClusterIP      172.20.72.227    <none>             20001/TCP                                                                                                                                    1d
prometheus               ClusterIP      172.20.25.75     <none>             9090/TCP                                                                                                                                     1d
tracing                  ClusterIP      172.20.211.135   <none>             80/TCP                                                                                                                                       1d
zipkin                   ClusterIP      172.20.204.123   <none>             9411/TCP                                                                                                                                     1d

I have't use any egress gateway and my outboundTrafficPolicy mode is ALLOW_ANY. So, I am assuming I don't need any service entry.

I am using nginx ingress controller as a entrypoint to my cluster and haven't started istio ingress gateway.

Issue:

I have a micro service in my cluster that is hitting/reaching-out to an external URL(out of my cluster to legacy system) with a HTTP POST query. This system normally respond backs to my microservice in 25 seconds and have hard timeout on itself 0f 30 seconds.

When I am not using istio sidecar, my microservice is responding normally. But after deploying istio with sidecar I am getting 504 gateway everytime after 15 seconds.

Logs of microservice and well as istio-proxy:

Microservice logs without istio(search response took 21.957 Seconds in logs)

2019-09-06 19:42:20.113  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-4] c.s.t.s.impl.common.PrepareRequest       : Start Preparing search request
2019-09-06 19:42:20.117  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-4] c.s.t.s.impl.common.PrepareRequest       : Done Preparing search request
2019-09-06 19:42:42.162  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-8] c.s.t.service.impl.TVSearchServiceImpl   : xxxx search response took 21.957 Seconds
2019-09-06 19:42:42.292  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-8] c.s.t.service.impl.common.Transformer    : Doing transformation of supplier response into our response
2019-09-06 19:42:42.296  INFO [xx-xx-adapter,9b32565791541300,9b32565791541300,false] 1 --- [or-http-epoll-8] c.s.t.service.impl.common.Transformer    : Transformer: Parsing completed in 3 mSeconds

Microservice logs with istio(response took 15.009 Seconds in logs)

2019-09-06 19:40:00.048  INFO [xxx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-3] c.s.t.s.impl.common.PrepareRequest       : Start Preparing search request
2019-09-06 19:40:00.048  INFO [xxx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-3] c.s.t.s.impl.common.PrepareRequest       : Done Preparing search request

2019-09-06 19:40:15.058  INFO [xx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] c.s.t.service.impl.xxxx   : xxx Search Request {"rqst":{"Request":{"__type":"xx","CheckIn":"/Date(1569628800000+0000)/","CheckOut":"/Date(1569801600000+0000)/","DetailLevel":9,"ExcludeHotelDetails":false,"GeoLocationInfo":{"Latitude":25.204849,"Longitude":55.270782},"Nights":0,"RadiusInMeters":25000,"Rooms":[{"AdultsCount":2,"KidsAges":[2]}],"DesiredResultCurrency":"EUR","Residency":"GB","TimeoutSeconds":25,"ClientIP":"127.0.0.1"},"RequestType":1,"TypeOfService":2,"Credentials":{"UserName":"xxxx","Password":"xx"}}}
2019-09-06 19:40:15.058 ERROR [xx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] c.s.t.service.impl.xxxx   : xxx Search request failed 504 GATEWAY_TIMEOUT

2019-09-06 19:40:15.058  INFO [xxx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] c.s.t.service.impl.xxxx   : xx search response took 15.009 Seconds

2019-09-06 19:40:15.059 ERROR [xxx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] a.w.r.e.AbstractErrorWebExceptionHandler : [79d38e2f] 500 Server Error for HTTP POST "/search/geo-location"

java.lang.RuntimeException: Error occurred, We did not receive proper search response from xx please check internal logs for more info

<Java Stack trace >

2019-09-06 19:40:15.061 ERROR [xxx-xx-adapter,32c55821a507d6f3,32c55821a507d6f3,false] 1 --- [or-http-epoll-7] c.s.t.service.impl.xxxx   : xxx search response upstream request timeout
2019-09-06 19:41:16.081  INFO [xxx-xx--adapter,,,] 1 --- [      Thread-22] o.s.s.concurrent.ThreadPoolTaskExecutor  : Shutting down ExecutorService 'executor'

Envoy sidecar proxy logs Masked

[2019-09-06T20:32:15.418Z] "POST /xxxxxx/xxxxx.svc/xx/ServiceRequest HTTP/1.1" 504 UT "-" "-" 517 24 14997 - "-" "ReactorNetty/0.8.10.RELEASE" "c273fac1-8xxa-xxxx-xxxx-xxxxxx" "testdomain.testurl.com" "40.67.217.71:80" PassthroughCluster - 1.7.17.71:80 1.20.21.25:42386 -

[2019-09-06 20:39:01.719][34][debug][router] [external/envoy/source/common/router/router.cc:332] [C57][S17104382791712695742] cluster 'PassthroughCluster' match for URL '/xxxxx/xxxx.svc/json/ServiceRequest'


[2019-09-06 20:39:01.719][34][debug][router] [external/envoy/source/common/router/router.cc:332] [C57][S17104382791712695742] cluster 'PassthroughCluster' match for URL '/xxxxx/xxxx.svc/json/ServiceRequest'
[2019-09-06 20:39:01.719][34][debug][upstream] [external/envoy/source/common/upstream/original_dst_cluster.cc:87] Created host 40.67.217.71:80.
[2019-09-06 20:39:01.719][34][debug][router] [external/envoy/source/common/router/router.cc:393] [C57][S17104382791712695742] router decoding headers:
':authority', 'x.x.com'
':path', '/xxxxx/xxxxx.svc/json/ServiceRequest'
':method', 'POST'
':scheme', 'http'
'user-agent', 'ReactorNetty/0.8.10.RELEASE'
'accept', 'application/json'
'accept-encoding', 'gzip, deflate'
'x-newrelic-transaction', 'PxRSBVVQXAdVUgNTUgcPUQUBFB8EBw8RVU4aB1wLB1YHAA8DAAQFWlNXB0NKQV5XCVVQAQcGFTs='
'x-newrelic-id', 'VgUDWFVaARADUFNWAgQHV1A='
'content-type', 'application/json;charset=UTF-8'
'content-length', '517'
'x-forwarded-proto', 'http'
'x-request-id', '750f4fdb-83f9-409c-9ecf-e0a1fdacbb65'
'x-istio-attributes', 'CloKCnNvdXJjZS51aWQSTBJKa3ViZXJuZXRlczovL2hvdGVscy10di1hZGFwdGVyLXNlcnZpY2UtZGVwbG95bWVudC03ZjQ0ZDljNjVjLWhweHo3LmRlZmF1bHQ='
'x-envoy-expected-rq-timeout-ms', '15000'
'x-b3-traceid', '971ac547c63fa66e'
'x-b3-spanid', '58c12e7da54ae50f'
'x-b3-parentspanid', 'dc7bda5b98d522bf'
'x-b3-sampled', '0'

[2019-09-06 20:39:01.719][34][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:88] creating a new connection
[2019-09-06 20:39:01.719][34][debug][client] [external/envoy/source/common/http/codec_client.cc:26] [C278] connecting
[2019-09-06 20:39:01.719][35][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:973] membership update for TLS cluster PassthroughCluster added 1 removed 0
[2019-09-06 20:39:01.719][28][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:973] membership update for TLS cluster PassthroughCluster added 1 removed 0
[2019-09-06 20:39:01.719][34][debug][connection] [external/envoy/source/common/network/connection_impl.cc:704] [C278] connecting to 40.67.217.71:80
[2019-09-06 20:39:01.720][35][debug][upstream] [external/envoy/source/common/upstream/original_dst_cluster.cc:41] Adding host 40.67.217.71:80.
[2019-09-06 20:39:01.720][28][debug][upstream] [external/envoy/source/common/upstream/original_dst_cluster.cc:41] Adding host 40.67.217.71:80.
[2019-09-06 20:39:01.720][34][debug][connection] [external/envoy/source/common/network/connection_impl.cc:713] [C278] connection in progress
[2019-09-06 20:39:01.720][34][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:20] queueing request due to no available connections
[2019-09-06 20:39:01.720][34][debug][filter] [src/envoy/http/mixer/filter.cc:102] Called Mixer::Filter : decodeData (517, false)
[2019-09-06 20:39:01.720][34][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1079] [C57][S17104382791712695742] request end stream
[2019-09-06 20:39:01.720][34][debug][filter] [src/envoy/http/mixer/filter.cc:102] Called Mixer::Filter : decodeData (0, true)
[2019-09-06 20:39:01.720][34][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:973] membership update for TLS cluster PassthroughCluster added 1 removed 0
[2019-09-06 20:39:01.748][34][debug][connection] [external/envoy/source/common/network/connection_impl.cc:552] [C278] connected
[2019-09-06 20:39:01.748][34][debug][client] [external/envoy/source/common/http/codec_client.cc:64] [C278] connected
[2019-09-06 20:39:01.748][34][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:245] [C278] attaching to next request
[2019-09-06 20:39:01.748][34][debug][router] [external/envoy/source/common/router/router.cc:1210] [C57][S17104382791712695742] pool ready
[2019-09-06 20:39:02.431][35][debug][filter] [external/envoy/source/extensions/filters/listener/original_dst/original_dst.cc:18] original_dst: New connection accepted
[2019-09-06 20:39:02.431][35][debug][filter] [external/envoy/source/extensions/filters/listener/tls_inspector/tls_inspector.cc:72] tls inspector: new connection accepted
[2019-09-06 20:39:02.431][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:30] Called tcp filter: Filter
[2019-09-06 20:39:02.431][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:40] Called tcp filter: initializeReadFilterCallbacks
[2019-09-06 20:39:02.431][35][debug][filter] [external/envoy/source/common/tcp_proxy/tcp_proxy.cc:200] [C279] new tcp proxy session
[2019-09-06 20:39:02.431][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:135] [C279] Called tcp filter onNewConnection: remote 1.210.4.2:39482, local 1.10.1.5:80
[2019-09-06 20:39:02.431][35][debug][filter] [external/envoy/source/common/tcp_proxy/tcp_proxy.cc:343] [C279] Creating connection to cluster inbound|80|xx-xx-adapter-service|xx-xx-adapter-service.default.svc.cluster.local
[2019-09-06 20:39:02.431][35][debug][pool] [external/envoy/source/common/tcp/conn_pool.cc:80] creating a new connection
[2019-09-06 20:39:02.431][35][debug][pool] [external/envoy/source/common/tcp/conn_pool.cc:372] [C280] connecting
[2019-09-06 20:39:02.431][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:704] [C280] connecting to 127.0.0.1:80
[2019-09-06 20:39:02.431][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:713] [C280] connection in progress
[2019-09-06 20:39:02.431][35][debug][pool] [external/envoy/source/common/tcp/conn_pool.cc:106] queueing request due to no available connections
[2019-09-06 20:39:02.431][35][debug][main] [external/envoy/source/server/connection_handler_impl.cc:257] [C279] new connection
[2019-09-06 20:39:02.431][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:552] [C280] connected
[2019-09-06 20:39:02.431][35][debug][pool] [external/envoy/source/common/tcp/conn_pool.cc:293] [C280] assigning connection
[2019-09-06 20:39:02.431][35][debug][filter] [external/envoy/source/common/tcp_proxy/tcp_proxy.cc:542] TCP:onUpstreamEvent(), requestedServerName:
[2019-09-06 20:39:02.431][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:143] Called tcp filter completeCheck: OK
[2019-09-06 20:39:02.432][35][debug][filter] [src/istio/control/client_context_base.cc:140] Report attributes: attributes {
  key: "connection.event"
  value {
    string_value: "open"
  }
}
attributes {
  key: "connection.id"
  value {
    string_value: "82e869af-aec6-406a-8a52-4168a19eb1f0-279"

[2019-09-06 20:39:02.432][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:102] [C279] Called tcp filter onRead bytes: 130
[2019-09-06 20:39:02.435][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:125] [C279] Called tcp filter onWrite bytes: 147
[2019-09-06 20:39:02.435][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:102] [C279] Called tcp filter onRead bytes: 0
[2019-09-06 20:39:02.436][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:125] [C279] Called tcp filter onWrite bytes: 0
[2019-09-06 20:39:02.436][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:520] [C280] remote close
[2019-09-06 20:39:02.436][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:190] [C280] closing socket: 0
[2019-09-06 20:39:02.436][35][debug][pool] [external/envoy/source/common/tcp/conn_pool.cc:121] [C280] client disconnected
[2019-09-06 20:39:02.436][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:101] [C279] closing data_to_write=0 type=0
[2019-09-06 20:39:02.436][35][debug][connection] [external/envoy/source/common/network/connection_impl.cc:190] [C279] closing socket: 1
[2019-09-06 20:39:02.436][35][debug][filter] [src/envoy/tcp/mixer/filter.cc:174] [C279] Called tcp filter onEvent: 1 upstream 127.0.0.1:80
[2019-09-06 20:39:02.436][35][debug][filter] [src/istio/control/client_context_base.cc:140] Report attributes: attributes {
  key: "connection.duration"
  value {
    duration_value {
      nanos: 4358000
    }
  }
}
attributes {
  key: "connection.event"
  value {
    string_value: "close"
  }
}
at
[2019-09-06 20:39:02.436][35][debug][main] [external/envoy/source/server/connection_handler_impl.cc:68] [C279] adding to cleanup list
[2019-09-06 20:39:02.436][35][debug][pool] [external/envoy/source/common/tcp/conn_pool.cc:246] [C280] connection destroyed

I tried creating a virtual service over my MS with timeout of 30seconds but no luck.

I am not sure what I am missing. Need Help.

Crux: Internet(or from pod console) --> micro-service --> micro-service calling 3rd part legacy URL internally in code and getting timeouts in exact 15 seconds every-time.

Editing and adding more details:

Just to add the delay, I tried "curl http://slowwly.robertomurray.co.uk/delay/17000/url/http://www.google.com" (17000 = 17 seconds )from any of the microservice pod that have istio-proxy sidecar and always getting timeouts at 15 seconds. I am not sure where to change this 15 seconds envoy setting.

回答1:

After lot of hit and trial we got this working.

Reference: https://github.com/istio/istio/issues/16915#issuecomment-529210672

So for any outbound traffic no matter if you have egress gateway or not default timeout is 15 seconds. So, if you want to increase that you need to have a ServiceEntry and a VirtualService that defines the timeout.

ServiceEntry

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: external-se-test
spec:
  hosts:
  - slowwly.robertomurray.co.uk
  location: MESH_EXTERNAL
  ports:
  - number: 80
    name: example-http
    protocol: HTTP
  resolution: DNS

VirtualService

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: slow-ext
spec:
  hosts:
    - slowwly.robertomurray.co.uk
  http:
  - timeout: 30s
    route:
      - destination:
          host: slowwly.robertomurray.co.uk
        weight: 100


回答2:

It seems 15 seconds is a default timeout value. Whether it is Istio or Envoy which sets that, I have yet to read further.

But, there's a couple of reported issue such as #1888 (Istio 0.2.12 and Kubernetes 1.8.1) and #6860 which was discussed to be very similar to your issue.:

INGRESS > PUBLICSERVICE (Timeout 60 works)

PUBLICSERVICE > BACKENDSERVICE (Timeout 60 works)

but

INGRESS > PUBLICSERVICE > BACKENDSERVICE (i get timeout after default 15seconds)

You can read through the discussion of 6860, but in summary, they have introduced a fix starting 1.0.0 so that the timeout can be set greater than 15 seconds. To do that, you need to write the timeout in your VirtualServices. For example, for the internal service set 60 seconds, for your external service set to 40 seconds so that there is enough allowance.

Please check and see if it helps.