Nginx 负载均衡最佳实践

下面是一份真正适用于生产环境、简洁但专业、可直接落地的《Nginx 负载均衡最佳实践》。

内容全是企业真实经验总结，注意事项、坑点、调优都有。

1. 明确目标：高可用、高性能、可扩展

负载均衡的核心目的：

分摊流量，避免单机瓶颈
避免雪崩效应（一个节点挂掉导致全部卡死）
保证服务可扩展
提升吞吐与稳定性

**务必记住：**LB 不只是简单轮询，它是整个服务稳定性的核心组件。

2. Upstream 最佳结构（keepalive + health check + retry）

upstream backend {
    least_conn;

    server 10.0.0.1:8000 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:8000 max_fails=3 fail_timeout=30s;

    keepalive 50;  # ⭐ 必开：减少后端连接开销
}

推荐算法：

优先 least_conn（避免慢节点拖累系统）。

为什么不用默认 round robin？

-> 会把请求分给正在卡住的节点。

3. 必须启用 proxy_next_upstream（重试）避免雪崩

proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;

作用：

当节点失败 -> 自动切换下一个
避免因为单个节点卡死导致整站崩溃
API 场景必备

企业级救命配置。

4. 超时（timeout）一定要激进

proxy_connect_timeout 3s;
proxy_send_timeout    30s;
proxy_read_timeout    30s;

最佳实践：

Connect 层必须短（3 秒内连不上就是故障）
Read/Send 可长（业务需要）

这直接决定系统是否能快速剔除坏节点。

5. 开启 HTTP/1.1 + Connection 优化（性能飞跃）

proxy_http_version 1.1;
proxy_set_header Connection "";

为什么必须：

无 Connection: close
不断开 TCP 连接
提高 2~5 倍吞吐（特别是短连接 API）

Nginx 官方推荐。

6. 使用 weight 让高性能节点分配更高流量

server 10.0.0.1 weight=5;
server 10.0.0.2 weight=1;

高 CPU、高内存机器 -> 给更多 weight。

不要把弱节点当平等节点，否则拖垮整个集群。

7. 使用 backup 节点做容灾

server 10.0.0.3 backup;

主节点全挂才启用。

适用于：

低配备用机
临时维护
自动故障转移

8. Session 粘性不要用 ip_hash（移动网络不稳定）

不要用：

ip_hash;

因为：

手机用户频繁换 IP
会话丢失严重
基本不适用于真实业务

8.1 使用 sticky（cookie）

sticky cookie srv_id expires=1h path=/;

8.2 或使用 Redis Session / JWT（最推荐）

让 session 与节点解耦，负载均衡更自由。

9. 避免慢节点拖累 — slow_start 必须开（商业版）

商业版：

slow_start 30s;

新节点上线 -> 平滑接管请求
避免新实例刚启动被流量打爆。

开源版：没有 slow_start，只能手动设置 weight。

10. 错误页面不要向 upstream 泄露

proxy_intercept_errors on;
error_page 500 502 503 504 /50x.html;

避免：

泄露后端服务器信息
让用户看到一堆 502/504

11. 为每个 upstream 设置 keepalive（强制必须）

keepalive 50;

开了 keepalive：

后端压力骤降
延迟降低几十毫秒
吞吐量显著提升
特别适合短连接 API（FastAPI、Go、Node）

12. 多个 upstream -> 分模块部署，避免巨大 Upstream

避免这样：

one huge upstream with 30 nodes

最佳实践：按业务拆分：

upstream user_api { ... }
upstream order_api { ... }
upstream file_api { ... }

优点：

故障隔离
性能隔离
易维护
支持灰度发布

13. 启用灰度发布（生产可控发布）

map $cookie_gray $upstream {
    1 backend_gray;
    default backend;
}

location / {
    proxy_pass http://$upstream;
}

效果：

灰度用户访问灰度 upstream
正常用户访问生产 upstream
不影响全部流量
线上可控

14. 启用 Nginx stream 做 4 层 LB（TCP/UDP）

适用于：

MySQL
Redis
PostgreSQL
MQTT
自定义 TCP 协议

示例：

stream {
    upstream mysql_cluster {
        server 10.0.0.1:3306 max_fails=3 fail_timeout=30s;
        server 10.0.0.2:3306 max_fails=3 fail_timeout=30s;
    }

    server {
        listen 3306;
        proxy_pass mysql_cluster;
    }
}

15. 总览：生产推荐最佳组合（通用版）

upstream backend {
    least_conn;
    server 10.0.0.1:8000 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:8000 max_fails=3 fail_timeout=30s;
    keepalive 50;
}

server {
    listen 80;

    proxy_http_version 1.1;
    proxy_set_header Connection "";

    proxy_next_upstream         error timeout http_500 http_502 http_503 http_504;
    proxy_next_upstream_tries   2;

    proxy_connect_timeout   3s;
    proxy_send_timeout     30s;
    proxy_read_timeout     30s;

    location / {
        proxy_pass http://backend;
    }
}

这是 99% 公司都在用的生产级 LB 最佳配置。

16. 总结：真正的最佳实践

least_conn >= round robin
启用 keepalive
aggressive connect timeout（3s）
自动重试（proxy_next_upstream）
可用性保障（max_fails / fail_timeout）
避免 ip_hash（移动网络问题）
session 用 Redis/JWT
适用场景选择不同 upstream
使用备份机器提升高可用
使用错误页面隔离后端错误
生产环境必须拆分模块