可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Currently I am running a load test using JMeter on our system build on grails 3 running on tomcat. After sending 20k request per second I got “no live upstreams while connecting to upstream client” in nginx error log. Our application is multi-tenant base so I need to perform high load. Here is my nginx configuration.

worker_processes  16;
worker_rlimit_nofile 262144;
error_log  /var/log/nginx/error.log;

events {
    worker_connections  24576;
    use epoll;
    multi_accept on;
}


http {
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;
    keepalive_timeout  600;
    keepalive_requests 100000;
    access_log off;
    server_names_hash_max_size  4096;
    underscores_in_headers  on;
    client_max_body_size 8192m;
    log_format vhost '$remote_addr - $remote_user [$time_local] $status "$request" $body_bytes_sent "$http_referer" "$http_user_agent" "http_x_forwarded_for"';

    proxy_connect_timeout      120;
    proxy_send_timeout         120;
    proxy_read_timeout         120;


    gzip  on;
    gzip_types text/plain application/xml text/css text/js text/xml application/x-javascript text/javascript application/json application/xml+rss image application/javascript;
    gzip_min_length  1000;
    gzip_static on;
    gzip_vary on;
    gzip_buffers 16 8k;
    gzip_comp_level 6;
    gzip_proxied any;
    gzip_disable "msie6";

    proxy_intercept_errors on;
    recursive_error_pages on;

    ssl_prefer_server_ciphers On;
    ssl_ciphers ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-RC4-SHA:ECDHE-RSA-AES256-SHA:RC4-SHA;
    include /etc/nginx/conf.d/*.conf;
}

How do I configure for high concurrent load?

回答1:

For me, the issue was with my proxy_pass entry. I had

location / {
        ...
        proxy_pass    http://localhost:5001;
    }

This caused the upstream request to use the IP4 localhost IP or the IP6 localhost IP, but every now and again, it would use the localhost DNS without the port number resulting in the upstream error as seen in the logs below.

[27/Sep/2018:16:23:37 +0100] <request IP> - - - <requested URI>  to: [::1]:5001: GET /api/hc response_status 200
[27/Sep/2018:16:24:37 +0100] <request IP> - - - <requested URI>  to: 127.0.0.1:5001: GET /api/hc response_status 200
[27/Sep/2018:16:25:38 +0100] <request IP> - - - <requested URI>  to: localhost: GET /api/hc response_status 502
[27/Sep/2018:16:26:37 +0100] <request IP> - - - <requested URI>  to: 127.0.0.1:5001: GET /api/hc response_status 200
[27/Sep/2018:16:27:37 +0100] <request IP> - - - <requested URI>  to: [::1]:5001: GET /api/hc response_status 200

As you can see, I get a 502 status for "localhost:"

Changing my proxy_pass to 127.0.0.1:5001 means that all requests now use IP4 with a port.

This StackOverflow response was a big help in finding the issue as it detailed changing the log format to make it possible to see the issue.

回答2:

I saw such behavior many times during perf. tests.

Under heavy workload the performance of your upstream server(s) may not be enough and upstream module may mark upstream server(s) as unavailable.

The relevant parameters (server directive) are:

max_fails=number

sets the number of unsuccessful attempts to communicate with the server that should happen in the duration set by the fail_timeout parameter to consider the server unavailable for a duration also set by the fail_timeout parameter. By default, the number of unsuccessful attempts is set to 1. The zero value disables the accounting of attempts. What is considered an unsuccessful attempt is defined by the proxy_next_upstream, directives.

fail_timeout=time

sets:

the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable;
and the period of time the server will be considered unavailable.

By default, the parameter is set to 10 seconds.