PostgreSQL 故障排查
下面是一份 真正生产可用、体系化的 PostgreSQL 故障排查指南。
覆盖 连接问题、性能问题、锁等待、死锁、WAL、磁盘、内存、复制、表膨胀、数据损坏、配置问题 等全部常见故障。
分为下面 10 大类故障:
- 数据库无法连接
- CPU、内存、磁盘暴涨
- 查询突然变慢
- 锁等待 / 死锁问题
- Autovacuum / 表膨胀问题
- WAL / checkpoint 写入变慢或卡死
- 磁盘空间不足 / WAL 堆积
- 复制集群问题(主从延迟、复制中断)
- 配置错误导致启动失败
- 数据损坏 / 索引损坏
每一类都包含:
- 症状
- 排查步骤
- 解决方案
常见表现
- psql / 应用连接失败
- “connection refused”
- “too many connections”
- “could not connect to server”
排查步骤
pg_ctl status
systemctl status postgresql
ss -lnt | grep 5432
SELECT count(*) FROM pg_stat_activity;
SHOW max_connections;
ufw status
iptables -L
定位最耗资源的 SQL
SELECT pid, usename, state, query_start, query
FROM pg_stat_activity
ORDER BY query_start ASC;
查看 CPU 等待 IO(IO wait 高则查询变慢)
iostat -x 1
查 PostgreSQL 自己的统计
SELECT * FROM pg_stat_bgwriter;
解决方案
- 加索引
- 调参(work_mem / shared_buffers / effective_cache_size)
- 优化 SQL
- 使用 PgBouncer 限制连接
排查顺序(必须记住):
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
SELECT relname, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;
SELECT * FROM pg_stat_bgwriter;
SELECT * FROM pg_stat_activity WHERE xact_start < now() - interval '5 minutes';
SELECT * FROM pg_locks;
查阻塞链
SELECT blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_l ON blocked.pid = blocked_l.pid
JOIN pg_locks blocking_l
ON blocking_l.locktype = blocked_l.locktype
AND blocking_l.DATABASE = blocked_l.DATABASE
AND blocking_l.relation = blocked_l.relation
JOIN pg_stat_activity blocking
ON blocking.pid = blocking_l.pid
WHERE NOT blocked_l.granted;
SELECT pg_terminate_backend(<pid>);
开启:
log_lock_waits = on
log_min_duration_statement = 500ms
表膨胀检查(pgstattuple)
SELECT * FROM pgstattuple('your_table');
Autovacuum 是否落后
SELECT relname, last_vacuum, last_autovacuum
FROM pg_stat_user_tables;
自动清理不工作常见原因
- 表太大 -> autovacuum 跑不完
- 长事务阻塞 autovacuum
- autovacuum 参数太保守(默认不适合生产)
推荐参数
autovacuum_vacuum_scale_factor = 0.1
autovacuum_analyze_scale_factor = 0.05
symptom
- 写入突然卡死
- WAL 目录暴涨
查看 checkpoint 次数
SELECT checkpoints_timed, checkpoints_req FROM pg_stat_bgwriter;
查看 WAL 大小
du -sh $PGDATA/pg_wal
建议配置
checkpoint_timeout = 15min
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_compression = on
查看数据目录大小
du -sh $PGDATA
查看最大的表
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
WAL 堆积常见原因
- 流复制断开 -> WAL 无法清除
- archive_command 失败
- checkpoint 太慢
解决:
- 修复复制
- 确保 archive_command 返回 0
- 增加 max_wal_size
查看复制延迟
SELECT * FROM pg_stat_replication;
字段:
- write_lag
- flush_lag
- replay_lag
常见复制问题
| 错误 | 原因 |
|---|---|
| “requested WAL segment has been removed” | 主库清理 WAL 前从库没读到 |
| “replication terminated by primary server” | 网络断 |
| “FATAL: could not receive data from WAL stream” | 网络抖动 |
解决方案
- 增大 wal_keep_size
- 增大备库硬件性能
- 使用 pg_basebackup 重建从库
执行:
journalctl -u postgresql
常见原因:
| 错误 | 原因 | 解决 |
|---|---|---|
| FATAL: lock file already exists | 上次异常退出 | 删除 postmaster.pid |
| FATAL: could not create shared memory | 参数过大 | 调整 shared_buffers |
| fatal: corrupt xlog | WAL 损坏 | pg_resetwal(危险) |
检查索引是否损坏
REINDEX TABLE table;
检查所有索引
REINDEX DATABASE dbname;
表损坏检查
SELECT * FROM pg_catalog.pg_stats WHERE null_frac > 0.9;
最后手段
pg_dump -> dropdb -> restore
大部分 PostgreSQL 故障都来自:
- 长事务
- 锁等待
- autovacuum 不够
- 表膨胀
- checkpoint 过于频繁
- 连接数过高
- 索引缺失或损坏
日常排查顺序:
SELECT * FROM pg_stat_activity;
SELECT * FROM pg_locks;
top / iostat / vmstat
查看日志
pg_stat_user_tables
pg_stat_bgwriter
du -sh pg_wal/