Currently I am running a load test using JMeter on our system build on grails 3 running on tomcat. After sending 20k request per second I got “no live upstreams while connecting to upstream client” in nginx error log. Our application is multi-tenant base so I need to perform high load. Here is my nginx configuration.
worker_processes 16;
worker_rlimit_nofile 262144;
error_log /var/log/nginx/error.log;
events {
worker_connections 24576;
use epoll;
multi_accept on;
}
http {
include mime.types;
default_type application/octet-stream;
sendfile on;
keepalive_timeout 600;
keepalive_requests 100000;
access_log off;
server_names_hash_max_size 4096;
underscores_in_headers on;
client_max_body_size 8192m;
log_format vhost '$remote_addr - $remote_user [$time_local] $status "$request" $body_bytes_sent "$http_referer" "$http_user_agent" "http_x_forwarded_for"';
proxy_connect_timeout 120;
proxy_send_timeout 120;
proxy_read_timeout 120;
gzip on;
gzip_types text/plain application/xml text/css text/js text/xml application/x-javascript text/javascript application/json application/xml+rss image application/javascript;
gzip_min_length 1000;
gzip_static on;
gzip_vary on;
gzip_buffers 16 8k;
gzip_comp_level 6;
gzip_proxied any;
gzip_disable "msie6";
proxy_intercept_errors on;
recursive_error_pages on;
ssl_prefer_server_ciphers On;
ssl_ciphers ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-RC4-SHA:ECDHE-RSA-AES256-SHA:RC4-SHA;
include /etc/nginx/conf.d/*.conf;
}
How do I configure for high concurrent load?
For me, the issue was with my proxy_pass entry. I had
location / {
...
proxy_pass http://localhost:5001;
}
This caused the upstream request to use the IP4 localhost IP or the IP6 localhost IP, but every now and again, it would use the localhost DNS without the port number resulting in the upstream error as seen in the logs below.
[27/Sep/2018:16:23:37 +0100] <request IP> - - - <requested URI> to: [::1]:5001: GET /api/hc response_status 200
[27/Sep/2018:16:24:37 +0100] <request IP> - - - <requested URI> to: 127.0.0.1:5001: GET /api/hc response_status 200
[27/Sep/2018:16:25:38 +0100] <request IP> - - - <requested URI> to: localhost: GET /api/hc response_status 502
[27/Sep/2018:16:26:37 +0100] <request IP> - - - <requested URI> to: 127.0.0.1:5001: GET /api/hc response_status 200
[27/Sep/2018:16:27:37 +0100] <request IP> - - - <requested URI> to: [::1]:5001: GET /api/hc response_status 200
As you can see, I get a 502 status for "localhost:"
Changing my proxy_pass to 127.0.0.1:5001 means that all requests now use IP4 with a port.
This StackOverflow response was a big help in finding the issue as it detailed changing the log format to make it possible to see the issue.
I saw such behavior many times during perf. tests.
Under heavy workload the performance of your upstream server(s) may not be enough and upstream module may mark upstream server(s) as unavailable.
The relevant parameters (server directive) are:
max_fails=number
sets the number of unsuccessful attempts to communicate with the server that should happen in the duration set by the fail_timeout
parameter to consider the server unavailable for a duration also set by the fail_timeout
parameter. By default, the number of unsuccessful attempts is set to 1. The zero value disables the accounting of attempts. What is considered an unsuccessful attempt is defined by the proxy_next_upstream
, directives.
fail_timeout=time
sets:
the time during which the specified number of unsuccessful attempts
to communicate with the server should happen to consider the server
unavailable;
and the period of time the server will be considered unavailable.
By default, the parameter is set to 10 seconds.