Linux 运维实战手册#
面向运维工程师的实战参考手册,涵盖监控、日志、备份、自动化、安全加固、高可用、性能调优、故障排查、信创系统等核心运维领域。与 Linux-使用手册.md 互补,侧重运维场景与操作实践。
第一章:运维体系概述#
1.1 运维核心职责#
| 领域 |
职责 |
关键指标 |
| 监控 |
7×24 系统状态感知 |
MTTR, MTTD |
| 变更管理 |
可控的系统变更流程 |
变更成功率 |
| 容量规划 |
资源趋势分析与扩容 |
资源利用率 |
| 故障处理 |
快速定位与恢复 |
SLA, RTO, RPO |
| 安全合规 |
系统加固与审计 |
漏洞修复时效 |
| 自动化 |
减少人工操作 |
自动化覆盖率 |
1.2 运维 SLA 指标#
可用性 = (总时间 - 故障时间) / 总时间 × 100%
99% = 87.6 小时/年 (两个9)
99.9% = 8.76 小时/年 (三个9)
99.99% = 52.56 分钟/年 (四个9)
99.999% = 5.26 分钟/年 (五个9)
RTO (Recovery Time Objective): 业务恢复时间目标
RPO (Recovery Point Objective): 数据丢失时间目标
1.3 运维工具链全景#
┌─────────────────────────────────────────────────────────────────┐
│ 运维工具矩阵 │
├──────────────┬──────────────────┬───────────────────────────────┤
│ 监控告警 │ Prometheus │ Grafana, Zabbix, Nagios │
│ 日志管理 │ ELK, Loki │ Splunk, Graylog │
│ 自动化 │ Ansible │ SaltStack, Puppet, Chef │
│ CI/CD │ Jenkins, GitLab │ ArgoCD, Tekton │
│ 配置管理 │ Ansible, Terraform│ Pulumi │
│ 容器编排 │ K8s, K3s │ Nomad, Docker Swarm │
│ 备份恢复 │ restic, Borg │ Bacula, Veeam │
│ 安全扫描 │ Trivy, ClamAV │ OpenSCAP, Lynis │
│ 网络诊断 │ tcpdump, nmap │ Wireshark, mtr │
│ 压力测试 │ wrk, ab, sysbench│ JMeter, Locust │
│ 信创/国产 │ 麒麟, 统信UOS │ 欧拉, Anolis OS │
└──────────────┴──────────────────┴───────────────────────────────┘
第二章:监控体系建设#
2.1 Prometheus 监控栈#
2.1.1 架构概览#
┌──────────┐ ┌──────────┐ ┌──────────┐
│ node_ex │ │ mysql_ex │ │ nginx_ex │ ← Exporters
│ porter │ │ porter │ │ porter │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────┐
│ Prometheus Server │
│ (Pull metrics / 存储时序数据 / 告警判定) │
└──────────┬────────────────────┬─────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Grafana │ │ AlertManager │
│ (可视化) │ │ (告警管理) │
└──────────────┘ └──────┬───────┘
│
▼
┌──────────────┐
│ Webhook/邮件 │
│ 微信/钉钉/飞书│
└──────────────┘
2.1.2 Prometheus 安装配置#
# === 下载安装 ===
cd /opt
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar xzf prometheus-2.52.0.linux-amd64.tar.gz
ln -s prometheus-2.52.0.linux-amd64 prometheus
# === 创建 systemd 服务 ===
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file=/opt/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention.time=30d \
--web.enable-lifecycle \
--web.external-url=http://prometheus.example.com
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
useradd -r -s /sbin/nologin prometheus
mkdir -p /data/prometheus
chown -R prometheus:prometheus /opt/prometheus /data/prometheus
systemctl daemon-reload && systemctl enable --now prometheus
2.1.3 prometheus.yml 核心配置#
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
datacenter: 'bj-idc-01'
env: 'production'
# 告警规则文件
rule_files:
- 'rules/*.yml'
# 告警管理器
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# 采集目标
scrape_configs:
# Prometheus 自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (系统指标)
- job_name: 'node'
scrape_interval: 30s
static_configs:
- targets:
- '192.168.1.11:9100'
- '192.168.1.12:9100'
- '192.168.1.13:9100'
labels:
env: 'production'
# 基于文件的动态发现
file_sd_configs:
- files:
- '/opt/prometheus/targets/node/*.json'
refresh_interval: 5m
# MySQL Exporter
- job_name: 'mysql'
static_configs:
- targets: ['192.168.1.21:9104']
labels:
instance: 'mysql-master'
# Redis Exporter
- job_name: 'redis'
static_configs:
- targets: ['192.168.1.31:9121']
# Nginx Exporter (需 nginx-module-vts)
- job_name: 'nginx'
static_configs:
- targets: ['192.168.1.41:9113']
2.1.4 node_exporter 部署#
# 安装 node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
tar xzf node_exporter-1.8.0.linux-amd64.tar.gz
mv node_exporter-1.8.0.linux-amd64/node_exporter /usr/local/bin/
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.tcpstat \
--collector.filesystem.mount-points-exclude='^/(dev|proc|sys|run|var/lib/docker/.+|var/lib/kubelet/.+)' \
--web.listen-address=:9100
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload && systemctl enable --now node_exporter
2.1.5 常用 Exporters 速查#
| Exporter |
端口 |
用途 |
| node_exporter |
9100 |
系统 CPU/内存/磁盘/网络 |
| mysqld_exporter |
9104 |
MySQL/MariaDB |
| redis_exporter |
9121 |
Redis |
| postgres_exporter |
9187 |
PostgreSQL |
| nginx-prometheus-exporter |
9113 |
Nginx |
| blackbox_exporter |
9115 |
HTTP/TCP/ICMP 探测 |
| process-exporter |
9256 |
进程监控 |
| kafka_exporter |
9308 |
Kafka |
| elasticsearch_exporter |
9114 |
Elasticsearch |
2.1.6 Grafana 部署与配置#
# Ubuntu/Debian
apt-get install -y software-properties-common
add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | apt-key add -
apt-get update && apt-get install -y grafana
# CentOS/RHEL 7/8
cat > /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
yum install -y grafana
systemctl enable --now grafana-server
# 重置管理员密码
grafana-cli admin reset-admin-password newpassword
# 安装常用插件
grafana-cli plugins install grafana-piechart-panel
grafana-cli plugins install grafana-clock-panel
grafana-cli plugins install vonage-status-panel
systemctl restart grafana-server
2.1.7 重要 Grafana Dashboard ID#
| Dashboard ID |
名称 |
适用场景 |
| 1860 |
Node Exporter Full |
服务器全量指标 |
| 16098 |
Node Exporter / nodes |
新版服务器监控 |
| 7362 |
MySQL Overview |
MySQL 监控 |
| 763 |
Redis Dashboard |
Redis 监控 |
| 9628 |
PostgreSQL Database |
PostgreSQL |
| 12708 |
Nginx Overview |
Nginx 监控 |
| 11159 |
Docker Host & Container |
Docker 监控 |
2.1.8 告警规则示例#
# /opt/prometheus/rules/node_alerts.yml
groups:
- name: node_alerts
interval: 30s
rules:
# 主机宕机
- alert: NodeDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "主机 {{ $labels.instance }} 宕机"
description: "主机 {{ $labels.instance }} 已超过 2 分钟不可达"
# CPU 使用率过高
- alert: HighCPUUsage
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
annotations:
summary: "CPU 使用率 > 90%: {{ $labels.instance }}"
description: "当前值: {{ $value | humanize }}%"
# 内存使用率
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率 > 90%: {{ $labels.instance }}"
# 磁盘使用率
- alert: HighDiskUsage
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘使用率 > 85%: {{ $labels.instance }} /"
# 磁盘预计填满
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0
for: 10m
labels:
severity: critical
annotations:
summary: "磁盘预计 4 小时内填满: {{ $labels.instance }}"
# 系统负载过高
- alert: HighSystemLoad
expr: node_load15 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 1.5
for: 10m
labels:
severity: warning
annotations:
summary: "负载过高 load15/cores > 1.5: {{ $labels.instance }}"
# 磁盘 IO 饱和
- alert: DiskIOSaturation
expr: rate(node_disk_io_time_seconds_total{device=~"sd[a-z]+"}[5m]) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "磁盘 IO 饱和: {{ $labels.instance }} {{ $labels.device }}"
# 网络错误
- alert: NetworkErrors
expr: rate(node_network_receive_errors_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "网络接口错误: {{ $labels.instance }} {{ $labels.device }}"
# 内存即将耗尽
- alert: OutOfMemorySoon
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "内存即将耗尽 (< 5%): {{ $labels.instance }}"
# inode 使用率
- alert: HighInodeUsage
expr: (1 - node_filesystem_files_free{mountpoint="/"} / node_filesystem_files{mountpoint="/"}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "inode 使用率 > 85%: {{ $labels.instance }}"
2.1.9 AlertManager 配置#
# /opt/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
# 邮件配置
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'password'
smtp_require_tls: true
# 告警路由
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
continue: true
- match:
severity: warning
receiver: 'warning'
# 接收器
receivers:
- name: 'default'
email_configs:
- to: '[email protected]'
- name: 'critical'
email_configs:
- to: '[email protected]'
webhook_configs:
# 钉钉
- url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
send_resolved: true
# 企业微信
wechat_configs:
- corp_id: 'wwxxx'
to_party: '1'
agent_id: '1000001'
api_secret: 'xxx'
send_resolved: true
- name: 'warning'
email_configs:
- to: '[email protected]'
# 抑制规则 (避免告警风暴)
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance']
2.2 PromQL 常用查询#
# === CPU ===
# CPU 使用率 (%)
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# 各 CPU 模式占比
avg by(cpu,mode)(rate(node_cpu_seconds_total[5m])) * 100
# === 内存 ===
# 可用内存百分比
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# === 磁盘 ===
# 磁盘使用率
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
# 磁盘读写速率 MB/s
rate(node_disk_read_bytes_total[5m]) / 1024 / 1024
rate(node_disk_written_bytes_total[5m]) / 1024 / 1024
# 磁盘 IOPS
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])
# disk_io_time (IO 繁忙度)
rate(node_disk_io_time_seconds_total[5m]) * 100
# === 网络 ===
# 网络流量 bytes/sec
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# 网络连接状态
node_netstat_Tcp_CurrEstab
# TCP 重传率
rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) * 100
# === 进程 ===
# 打开文件描述符数量
process_open_fds
# === 系统 ===
# 运行时间 (秒)
node_boot_time_seconds
# 预测磁盘空间
predict_linear(node_filesystem_avail_bytes[1h], 24*3600) < 0
2.3 Grafana 告警 (内置)#
当不需要 AlertManager 时,Grafana 内置告警可直接使用:
# grafana.ini 配置
[smtp]
enabled = true
host = smtp.example.com:587
user = [email protected]
password = password
from_address = [email protected]
[alerting]
enabled = true
execute_alerts = true
告警通知渠道支持:Email, Slack, PagerDuty, Webhook, 钉钉(通过插件), 企业微信(通过插件)。
2.4 轻量监控方案#
2.4.1 Netdata (单机实时监控)#
# 一键安装
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# 仅本机访问
sed -i 's/bind to = \*/bind to = 127.0.0.1/g' /etc/netdata/netdata.conf
systemctl restart netdata
# 访问 http://localhost:19999
# 特点: 零配置、极低资源占用、1秒粒度、数千指标自动采集
2.4.2 自定义脚本监控 (最简方案)#
#!/bin/bash
# /opt/scripts/monitor.sh
# crontab: */5 * * * * /opt/scripts/monitor.sh
HOSTNAME=$(hostname)
ALERT_WEBHOOK="https://hooks.slack.com/services/xxx"
# CPU
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > 90" | bc -l) )); then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[${HOSTNAME}] CPU 使用率: ${CPU_USAGE}%\"}" \
$ALERT_WEBHOOK
fi
# 内存
MEM_AVAIL=$(free -m | awk 'NR==2{print $7}')
if [ "$MEM_AVAIL" -lt 512 ]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[${HOSTNAME}] 可用内存不足: ${MEM_AVAIL}MB\"}" \
$ALERT_WEBHOOK
fi
# 磁盘
DISK_USAGE=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 85 ]; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[${HOSTNAME}] 磁盘使用率: ${DISK_USAGE}%\"}" \
$ALERT_WEBHOOK
fi
# 关键进程检查
for proc in nginx mysqld sshd; do
if ! pgrep -x "$proc" > /dev/null; then
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[${HOSTNAME}] 进程 $proc 未运行!\"}" \
$ALERT_WEBHOOK
fi
done
第三章:日志管理#
3.1 rsyslog 配置#
3.1.1 基础架构#
应用程序 → syslog() → rsyslog → /var/log/messages
→ /var/log/secure
→ 远程 syslog 服务器
→ 管道/程序
3.1.2 rsyslog.conf 配置#
# /etc/rsyslog.conf
# === 模块加载 ===
module(load="imuxsock") # 本地 socket
module(load="imklog") # 内核日志
module(load="imtcp") # TCP 接收
module(load="imudp") # UDP 接收
module(load="impstats"
interval="300"
severity="7"
log.file="/var/log/rsyslog-stats.log"
Ruleset="stats") # 性能统计
# === 全局配置 ===
global(
workDirectory="/var/lib/rsyslog"
maxMessageSize="64k"
defaultTemplate="RSYSLOG_TraditionalFileFormat"
privDropToUser="syslog"
privDropToGroup="syslog"
)
# === 日志格式模板 ===
template(name="RemoteLogs" type="string"
string="%TIMESTAMP% %HOSTNAME% %syslogtag% %msg%\n")
template(name="JsonFormat" type="list") {
constant(value="{")
constant(value="\"timestamp\":\"") property(name="timereported" dateFormat="rfc3339")
constant(value="\",\"host\":\"") property(name="hostname")
constant(value="\",\"severity\":\"") property(name="syslogseverity")
constant(value="\",\"facility\":\"") property(name="syslogfacility")
constant(value="\",\"tag\":\"") property(name="syslogtag" format="json")
constant(value="\",\"message\":\"") property(name="msg" format="json")
constant(value="\"}\n")
}
# === 日志规则 ===
# 认证日志
auth,authpriv.* /var/log/secure
# 系统日志
*.info;mail.none;authpriv.none /var/log/messages
# Cron 日志
cron.* /var/log/cron
# 内核日志
kern.* /var/log/kern.log
# 邮件日志
mail.* /var/log/maillog
# Debug 日志 (丢弃)
*.debug stop
# 紧急日志发送给所有登录用户
*.emerg :omusrmsg:*
# === 转发到远程 ===
# TCP 转发
*.* @@192.168.1.100:514
# UDP 转发
# *.* @192.168.1.100:514
# 条件转发 (仅错误级别以上)
*.err @@192.168.1.100:514
# === 作为日志服务器接收 ===
input(type="imtcp" port="514" Ruleset="remote")
input(type="imudp" port="514" Ruleset="remote")
ruleset(name="remote") {
# 按主机名分文件
$template RemotePath,"/data/logs/%HOSTNAME%/%$YEAR%-%$MONTH%-%$DAY%.log"
action(type="omfile" dynaFile="RemotePath")
# 也可输出为 JSON
# action(type="omfile" dynaFile="RemotePath" template="JsonFormat")
}
3.1.3 应用日志配置示例#
# Nginx rsyslog 配置
# /etc/rsyslog.d/nginx.conf
$ModLoad imfile
$InputFileName /var/log/nginx/access.log
$InputFileTag nginx-access:
$InputFileStateFile stat-nginx-access
$InputFileSeverity info
$InputFileFacility local7
$InputRunFileMonitor
$InputFileName /var/log/nginx/error.log
$InputFileTag nginx-error:
$InputFileStateFile stat-nginx-error
$InputFileSeverity error
$InputFileFacility local7
$InputRunFileMonitor
local7.* @@192.168.1.100:514
3.2 journald (systemd 日志)#
# === 查看日志 ===
journalctl # 所有日志
journalctl -n 100 # 最近 100 行
journalctl -f # tail -f 模式
journalctl -k # 内核日志
journalctl -u nginx # 指定服务
journalctl -u nginx --since today # 今天的日志
journalctl -u nginx --since "2024-01-01" --until "2024-01-02"
journalctl -p err # 仅错误级别以上
journalctl -p emerg..err # emerg 到 err
journalctl _PID=1234 # 按 PID
journalctl _UID=0 # 按 UID (root)
journalctl -o json-pretty # JSON 输出
journalctl --disk-usage # 日志占用空间
journalctl -u sshd | grep "Failed" # 配合 grep
# === journald 配置 ===
# /etc/systemd/journald.conf
[Journal]
Storage=persistent # 持久化到磁盘
Compress=yes # 压缩
Seal=yes # 防篡改密封
SystemMaxUse=4G # 最多使用 4G
SystemMaxFileSize=100M # 单文件最大
MaxRetentionSec=2week # 最多保留 2 周
RuntimeMaxUse=1G # /run 下最大使用
ForwardToSyslog=no # 是否转发到 syslog
ForwardToConsole=no
systemctl restart systemd-journald
3.3 logrotate 日志轮转#
# /etc/logrotate.conf (全局)
weekly
rotate 12
create
dateext
compress
include /etc/logrotate.d
# /etc/logrotate.d/nginx (应用级)
/var/log/nginx/*.log {
daily # 每天轮转
missingok # 日志不存在不报错
rotate 30 # 保留 30 天
compress # 压缩旧日志
delaycompress # 延迟一个周期压缩
notifempty # 空文件不轮转
create 640 nginx adm # 创建新文件权限
sharedscripts # 轮转完后执行一次脚本
postrotate
[ -f /var/run/nginx.pid ] && kill -USR1 $(cat /var/run/nginx.pid)
endscript
dateext # 日期后缀
dateformat -%Y%m%d
maxsize 500M # 超过 500M 强制轮转
}
# /etc/logrotate.d/syslog
/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
{
weekly
rotate 12
compress
dateext
missingok
sharedscripts
postrotate
/bin/kill -HUP $(cat /var/run/syslogd.pid 2>/dev/null) 2>/dev/null || true
endscript
}
# 手动执行轮转
logrotate -f /etc/logrotate.conf
logrotate -d /etc/logrotate.d/nginx # 调试模式 (不实际轮转)
# Crontab 中每日执行
# 0 0 * * * /usr/sbin/logrotate /etc/logrotate.conf
3.4 ELK Stack (Elasticsearch + Logstash + Kibana)#
3.4.1 架构#
应用日志 → Filebeat → Logstash → Elasticsearch → Kibana
↕
数据节点集群 (3+)
3.4.2 Filebeat 配置#
# /etc/filebeat/filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/nginx/access.log
fields:
app: nginx
type: access
fields_under_root: true
json.keys_under_root: true
json.add_error_key: true
- type: log
enabled: true
paths:
- /var/log/nginx/error.log
fields:
app: nginx
type: error
fields_under_root: true
multiline.pattern: '^\d{4}/\d{2}/\d{2}'
multiline.negate: true
multiline.match: after
- type: log
enabled: true
paths:
- /var/log/messages
fields:
app: system
type: syslog
# === 输出到 Logstash ===
output.logstash:
hosts: ["192.168.1.101:5044"]
loadbalance: true
compression_level: 3
# === 或直接输出到 Elasticsearch ===
# output.elasticsearch:
# hosts: ["192.168.1.101:9200", "192.168.1.102:9200"]
# index: "filebeat-%{[agent.version]}-%{+yyyy.MM.dd}"
# === 处理器 ===
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
- drop_fields:
fields: ["agent.ephemeral_id", "agent.id"]
# === 日志本身 ===
logging.level: info
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat.log
keepfiles: 7
3.4.3 Logstash 配置#
# /etc/logstash/conf.d/pipeline.conf
input {
beats {
port => 5044
client_inactivity_timeout => 3600
}
}
filter {
if [type] == "access" {
grok {
match => {
"message" => '%{IPORHOST:client_ip} - %{DATA:remote_user} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:body_bytes_sent} "%{DATA:http_referer}" "%{DATA:http_user_agent}"'
}
}
date {
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
target => "@timestamp"
}
geoip {
source => "client_ip"
}
useragent {
source => "http_user_agent"
target => "user_agent"
}
}
if [type] == "syslog" {
grok {
match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
}
}
}
output {
elasticsearch {
hosts => ["http://192.168.1.101:9200", "http://192.168.1.102:9200"]
index => "logstash-%{[app]}-%{+YYYY.MM.dd}"
manage_template => false
}
# 如果 ES 不可用,存磁盘队列
# dead_letter_queue {
# path => "/data/logstash/dead_letter_queue"
# max_bytes => "1gb"
# }
}
3.5 Grafana Loki (轻量日志方案)#
Loki 是 Grafana 生态的日志方案,类似 Prometheus 但用于日志:
# promtail (日志采集器) 配置
# /etc/promtail/config.yml
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system
static_configs:
- targets: [localhost]
labels:
job: varlogs
host: ${HOSTNAME}
__path__: /var/log/*.log
- job_name: nginx
static_configs:
- targets: [localhost]
labels:
job: nginx
host: ${HOSTNAME}
__path__: /var/log/nginx/*.log
pipeline_stages:
- match:
selector: '{job="nginx"} |= "/var/log/nginx/error.log"'
stages:
- regex:
expression: '^(?P<time>\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\]'
- labels:
level:
# loki 配置 (单机)
# /etc/loki/loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 30m
max_chunk_age: 1h
chunk_target_size: 1536000
chunk_retain_period: 30s
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /data/loki/index
cache_location: /data/loki/cache
filesystem:
directory: /data/loki/chunks
limits_config:
enforce_metric_name: false
max_entries_limit_per_query: 5000
retention_period: 30d
compactor:
working_directory: /data/loki/compactor
shared_store: filesystem
第四章:备份与灾难恢复#
4.1 备份策略 3-2-1 原则#
3 份数据副本 (生产 + 2 备份)
2 种不同存储介质 (本地磁盘 + 磁带/云存储)
1 份异地备份 (不同数据中心/区域)
4.2 rsync 备份方案#
# === 基础用法 ===
# 本地同步
rsync -avz --delete /data/ /backup/
# 远程同步 (SSH)
rsync -avz -e "ssh -p 22" /data/ [email protected]:/backup/
# 远程拉取
rsync -avz [email protected]:/data/ /backup/
# === 生产级备份脚本 ===
#!/bin/bash
# /opt/scripts/backup.sh
BACKUP_SRC="/data/app"
BACKUP_DST="/backup"
REMOTE_HOST="[email protected]"
REMOTE_PATH="/backup/$(hostname)"
LOG_FILE="/var/log/backup.log"
EXCLUDE_FILE="/opt/scripts/backup_exclude.txt"
LOCK_FILE="/var/run/backup.lock"
log() {
echo "[$(date '+%F %T')] $*" | tee -a "$LOG_FILE"
}
# 防止并发执行
exec 200>"$LOCK_FILE"
flock -n 200 || { log "备份已在运行,退出"; exit 1; }
log "=== 开始备份 ==="
# 1. 本地每日快照 (保留 7 天)
rsync -avz \
--delete \
--exclude-from="$EXCLUDE_FILE" \
--link-dest="$BACKUP_DST/latest" \
"$BACKUP_SRC/" \
"$BACKUP_DST/$(date +%Y%m%d)/"
# 2. 更新 latest 符号链接
rm -f "$BACKUP_DST/latest"
ln -s "$BACKUP_DST/$(date +%Y%m%d)" "$BACKUP_DST/latest"
# 3. 远程同步
rsync -avz --delete \
-e "ssh -p 22 -i /root/.ssh/backup_key" \
"$BACKUP_DST/" "$REMOTE_HOST:$REMOTE_PATH/"
# 4. 清理旧备份 (超过 30 天的远程备份)
ssh -i /root/.ssh/backup_key "$REMOTE_HOST" \
"find $REMOTE_PATH -maxdepth 1 -type d -mtime +30 -exec rm -rf {} +"
log "=== 备份完成 ==="
4.3 restic (现代 Go 备份工具)#
# === 安装 ===
wget https://github.com/restic/restic/releases/download/v0.16.4/restic_0.16.4_linux_amd64.bz2
bunzip2 restic_0.16.4_linux_amd64.bz2
mv restic_0.16.4_linux_amd64 /usr/local/bin/restic
chmod +x /usr/local/bin/restic
# === 初始化仓库 ===
export RESTIC_REPOSITORY=/backup/restic
export RESTIC_PASSWORD=your_strong_password
restic init
# === 远程仓库 ===
export RESTIC_REPOSITORY=sftp:backup@storage:/backup/restic
export RESTIC_PASSWORD=your_strong_password
restic init
# S3 兼容 (MinIO / AWS S3)
export AWS_ACCESS_KEY_ID=xxx
export AWS_SECRET_ACCESS_KEY=yyy
export RESTIC_REPOSITORY=s3:s3.amazonaws.com/bucket-name/restic
restic init
# === 备份 ===
restic backup /data /etc /var/log --exclude "*.tmp" --exclude "*.log"
# === 快照管理 ===
restic snapshots # 列出快照
restic diff <snapshot1> <snapshot2> # 比较快照差异
restic stats # 仓库统计
# === 恢复 ===
restic restore latest --target /restore/path/ # 恢复最新
restic restore <snapshot_id> --target /restore/ # 恢复指定快照
restic restore latest --target /restore/ --include "/data/app/*"
# === 清理 ===
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 12 --keep-yearly 2
restic prune # 删除未引用的数据块
restic check # 验证仓库完整性
# === 自动备份脚本 ===
#!/bin/bash
export RESTIC_REPOSITORY=/backup/restic
export RESTIC_PASSWORD_FILE=/root/.restic-password
BACKUP_SRC="/data /etc /var/log"
restic backup $BACKUP_SRC \
--exclude "*.tmp" \
--exclude "*.bak" \
--tag "$(date +%Y%m%d)" \
--host "$(hostname)"
restic forget \
--keep-daily 7 \
--keep-weekly 4 \
--keep-monthly 12 \
--keep-yearly 2 \
--prune
restic check --read-data-subset=2%
4.4 数据库备份#
MySQL#
#!/bin/bash
# MySQL 全量 + binlog 备份
DB_USER="backup"
DB_PASS="password"
BACKUP_DIR="/backup/mysql"
RETENTION_DAYS=7
# 全量备份 (mysqldump)
mysqldump -u$DB_USER -p$DB_PASS --all-databases \
--single-transaction \
--routines --triggers --events \
--master-data=2 \
--set-gtid-purged=OFF \
| gzip > "$BACKUP_DIR/full_$(date +%Y%m%d_%H%M).sql.gz"
# 或使用 xtrabackup (物理备份,大库推荐)
xtrabackup --backup \
--user=$DB_USER --password=$DB_PASS \
--target-dir="$BACKUP_DIR/xtra_$(date +%Y%m%d_%H%M)" \
--compress --compress-threads=4
# 清理旧备份
find "$BACKUP_DIR" -mtime +$RETENTION_DAYS -delete
PostgreSQL#
#!/bin/bash
# PostgreSQL 备份
BACKUP_DIR="/backup/postgres"
export PGPASSWORD="password"
# 逻辑备份
pg_dumpall -U postgres -h localhost | gzip > "$BACKUP_DIR/pg_all_$(date +%Y%m%d).sql.gz"
# 单库自定义格式 (支持并行恢复)
pg_dump -U postgres -h localhost -Fc -j 4 mydb > "$BACKUP_DIR/mydb_$(date +%Y%m%d).dump"
# WAL 归档配置 (postgresql.conf)
# wal_level = replica
# archive_mode = on
# archive_command = 'test ! -f /backup/pg_wal/%f && cp %p /backup/pg_wal/%f'
4.5 系统级备份#
# === 分区镜像备份 (dd) ===
dd if=/dev/sda1 of=/backup/sda1_$(date +%Y%m%d).img bs=4M status=progress
# 压缩
dd if=/dev/sda1 bs=4M | gzip > /backup/sda1.img.gz
# 恢复
gunzip -c /backup/sda1.img.gz | dd of=/dev/sda1 bs=4M status=progress
# === tar 系统备份 ===
tar -cvpzf /backup/system_$(date +%Y%m%d).tar.gz \
--exclude=/proc \
--exclude=/tmp \
--exclude=/sys \
--exclude=/dev \
--exclude=/run \
--exclude=/mnt \
--exclude=/media \
--exclude=/backup \
--exclude=/lost+found \
/
# 恢复
tar -xvpzf /backup/system_20240101.tar.gz -C /
4.6 灾难恢复演练#
#!/bin/bash
# DR 演练检查清单脚本
echo "=== 灾难恢复检查清单 ==="
echo "日期: $(date)"
# 1. 验证备份完整性
echo -e "\n[1/6] 验证备份完整性..."
BACKUP_DIR="/backup"
LATEST=$(ls -t "$BACKUP_DIR" | head -1)
if [ -n "$LATEST" ]; then
echo " ✓ 最新备份: $LATEST"
else
echo " ✗ 未找到备份!"
fi
# 2. 验证备份可恢复性 (抽检)
echo -e "\n[2/6] 验证备份可恢复性..."
if restic check --read-data-subset=1%; then
echo " ✓ restic 仓库完整性验证通过"
fi
# 3. 验证数据库备份
echo -e "\n[3/6] 验证数据库备份..."
LATEST_SQL=$(ls -t /backup/mysql/*.sql.gz 2>/dev/null | head -1)
if gzip -t "$LATEST_SQL" 2>/dev/null; then
echo " ✓ MySQL 备份文件完整性验证通过"
fi
# 4. 验证远程同步
echo -e "\n[4/6] 验证远程备份..."
if rsync -azn --delete /backup/ [email protected]:/backup/; then
echo " ✓ 远程连接正常"
fi
# 5. 验证恢复文档
echo -e "\n[5/6] 验证恢复文档..."
if [ -f /opt/docs/recovery_procedure.md ]; then
echo " ✓ 恢复文档存在"
else
echo " ✗ 缺少恢复文档!"
fi
# 6. 验证恢复时间
echo -e "\n[6/6] 恢复时间估算..."
echo " 上次全量恢复耗时: ~30分钟 (记录于 2024-01-01)"
echo " RTO 目标: 2小时"
echo " RPO 目标: < 1小时 (binlog 实时同步)"
echo -e "\n=== 检查完成 ==="
第五章:自动化运维#
5.1 Ansible 基础#
5.1.1 核心概念#
┌─────────────┐
│ 控制节点 │ (Ansible 安装在此)
│ playbook │ 无需 agent,通过 SSH 管理
└──────┬──────┘
│ SSH
┌────┼────┐
▼ ▼ ▼
┌───┐┌───┐┌───┐
│ N1││ N2││ N3│ 被管理节点 (只需 Python)
└───┘└───┘└───┘
5.1.2 安装与配置#
# 安装
yum install -y ansible # CentOS/RHEL
apt-get install -y ansible # Ubuntu/Debian
pip install ansible # pip (最新版)
# 验证
ansible --version
# === ansible.cfg 配置 ===
# /etc/ansible/ansible.cfg 或 ./ansible.cfg
[defaults]
inventory = ./hosts
host_key_checking = False
remote_user = root
private_key_file = /root/.ssh/id_rsa
forks = 20
timeout = 30
log_path = /var/log/ansible.log
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_cache
fact_caching_timeout = 3600
retry_files_enabled = False
callback_whitelist = timer, profile_tasks
stdout_callback = yaml
[privilege_escalation]
become = True
become_method = sudo
become_user = root
[ssh_connection]
pipelining = True
control_path = /tmp/ansible-%%h-%%p-%%r
5.1.3 Inventory 主机清单#
# hosts (静态清单)
[webservers]
web01 ansible_host=192.168.1.11
web02 ansible_host=192.168.1.12
web03 ansible_host=192.168.1.13
[dbservers]
db01 ansible_host=192.168.1.21 ansible_user=dbadmin
db02 ansible_host=192.168.1.22
[appservers]
app[01:05].example.com # 范围: app01 ~ app05
[production:children] # 分组嵌套
webservers
dbservers
[production:vars] # 组变量
ansible_user=root
ntp_server=ntp.prod.example.com
[all:vars] # 全局变量
ansible_port=22
# hosts.yml (YAML 清单)
all:
hosts:
bastion:
ansible_host: 1.2.3.4
children:
production:
hosts:
web[01:03].example.com:
vars:
env: production
staging:
hosts:
web-stg.example.com:
vars:
env: staging
5.1.4 常用 Ad-Hoc 命令#
# 基本语法: ansible <pattern> -m <module> -a "<arguments>"
# === 信息收集 ===
ansible all -m ping # 存活检测
ansible all -m setup # 收集 facts
ansible all -m setup -a "filter=ansible_memory_mb" # 过滤 facts
ansible all -m shell -a "hostname; uptime" # 执行 shell
# === 文件操作 ===
ansible all -m copy -a "src=/tmp/file dest=/tmp/file" # 拷贝文件
ansible all -m fetch -a "src=/etc/hosts dest=/tmp/" # 拉取文件
ansible all -m file -a "path=/data state=directory mode=0755" # 创建目录
ansible all -m replace -a "path=/etc/nginx/nginx.conf regexp='worker_processes.*' replace='worker_processes 8;'" # 文件内容替换
# === 包管理 ===
ansible all -m yum -a "name=nginx state=latest" # 安装 (RHEL)
ansible all -m apt -a "name=nginx state=latest update_cache=yes" # 安装 (Debian)
# === 服务管理 ===
ansible all -m systemd -a "name=nginx state=restarted" # 重启服务
ansible all -m systemd -a "name=nginx enabled=yes" # 开机启动
# === 用户管理 ===
ansible all -m user -a "name=app password={{ 'mypass' | password_hash('sha512') }} groups=wheel state=present"
# === 内核参数 ===
ansible all -m sysctl -a "name=net.ipv4.tcp_tw_reuse value=1 sysctl_set=yes reload=yes"
# === 计划任务 ===
ansible all -m cron -a "name='log cleanup' hour=2 job='/opt/scripts/cleanup.sh'"
# === 防火墙 ===
ansible all -m firewalld -a "port=80/tcp permanent=yes state=enabled immediate=yes"
5.1.5 Playbook 编写#
# deploy_webapp.yml
---
- name: 部署 Web 应用
hosts: webservers
become: yes
vars:
app_name: myapp
app_port: 8080
app_version: "1.2.3"
nginx_worker_processes: "{{ ansible_processor_vcpus }}"
vars_files:
- vars/secrets.yml # 加密的敏感变量
pre_tasks:
- name: 更新 yum 缓存
yum:
update_cache: yes
name: '*'
state: latest
when: ansible_os_family == "RedHat"
tags: [update]
- name: 检查磁盘空间
shell: df -h /data | awk 'NR==2{print $5}' | sed 's/%//'
register: disk_usage
failed_when: disk_usage.stdout|int > 90
tasks:
- name: 安装基础包
package:
name: "{{ item }}"
state: present
loop:
- nginx
- supervisor
- python3
tags: [packages]
- name: 创建应用用户
user:
name: "{{ app_name }}"
system: yes
shell: /sbin/nologin
create_home: no
tags: [user]
- name: 创建目录结构
file:
path: "{{ item }}"
state: directory
owner: "{{ app_name }}"
mode: '0755'
loop:
- /opt/{{ app_name }}
- /opt/{{ app_name }}/config
- /data/{{ app_name }}
- /var/log/{{ app_name }}
tags: [dirs]
- name: 部署应用文件
copy:
src: files/{{ app_name }}-{{ app_version }}.jar
dest: /opt/{{ app_name }}/{{ app_name }}.jar
owner: "{{ app_name }}"
mode: '0644'
notify: restart app # 触发 handler
tags: [deploy]
- name: 配置 Nginx 反向代理
template:
src: templates/nginx.conf.j2
dest: /etc/nginx/conf.d/{{ app_name }}.conf
validate: nginx -t -c %s
notify: reload nginx
tags: [nginx]
- name: 配置 systemd 服务
template:
src: templates/app.service.j2
dest: /etc/systemd/system/{{ app_name }}.service
notify: restart app
tags: [service]
- name: 启动服务
systemd:
name: "{{ app_name }}"
state: started
enabled: yes
tags: [service]
handlers:
- name: reload nginx
systemd:
name: nginx
state: reloaded
- name: restart app
systemd:
name: "{{ app_name }}"
state: restarted
post_tasks:
- name: 健康检查
uri:
url: "http://localhost:{{ app_port }}/health"
status_code: 200
retries: 10
delay: 3
until: result.status == 200
register: result
tags: [verify]
- name: 发送部署通知
slack:
token: "{{ slack_token }}"
msg: "{{ app_name }} v{{ app_version }} 部署完成 [{{ ansible_hostname }}]"
tags: [notify]
5.1.6 Jinja2 模板示例#
{# templates/nginx.conf.j2 #}
upstream {{ app_name }}_backend {
{% for host in groups['appservers'] %}
server {{ hostvars[host]['ansible_host'] }}:{{ app_port }} weight=1 max_fails=3 fail_timeout=30s;
{% endfor %}
keepalive 32;
}
server {
listen 80;
server_name {{ app_name }}.example.com;
access_log /var/log/nginx/{{ app_name }}_access.log;
error_log /var/log/nginx/{{ app_name }}_error.log;
location / {
proxy_pass http://{{ app_name }}_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# 基于环境控制超时
{% if env == 'production' %}
proxy_read_timeout 60s;
{% else %}
proxy_read_timeout 300s;
{% endif %}
}
# 仅生产环境启用 SSL
{% if env == 'production' %}
listen 443 ssl;
ssl_certificate /etc/ssl/certs/{{ app_name }}.crt;
ssl_certificate_key /etc/ssl/private/{{ app_name }}.key;
if ($scheme != "https") {
return 301 https://$host$request_uri;
}
{% endif %}
}
5.1.7 Ansible Vault (加密敏感数据)#
# 创建加密文件
ansible-vault create vars/secrets.yml
# 编辑加密文件
ansible-vault edit vars/secrets.yml
# 加密已有文件
ansible-vault encrypt vars/secrets.yml
# 使用密码文件 (生产环境)
echo "my_vault_password" > ~/.vault_pass
ansible-vault encrypt vars/secrets.yml --vault-password-file ~/.vault_pass
# 运行 playbook 时解密
ansible-playbook deploy.yml --vault-password-file ~/.vault_pass
ansible-playbook deploy.yml --ask-vault-pass # 交互输入
# 多环境密码 (vault-id)
ansible-vault encrypt --vault-id prod@prompt vars/prod/secrets.yml
ansible-playbook deploy.yml --vault-id prod@prompt
5.1.8 Ansible Roles#
roles/
├── nginx/
│ ├── tasks/
│ │ ├── main.yml # 主任务入口
│ │ ├── install.yml # 安装
│ │ └── configure.yml # 配置
│ ├── handlers/
│ │ └── main.yml # Handler
│ ├── templates/
│ │ └── nginx.conf.j2
│ ├── files/
│ │ └── index.html
│ ├── vars/
│ │ └── main.yml # 默认变量
│ ├── defaults/
│ │ └── main.yml # 低优先级变量 (可覆盖)
│ ├── meta/
│ │ └── main.yml # 依赖声明
│ └── tests/
│ └── test.yml
# playbook 中使用 roles
---
- hosts: webservers
roles:
- role: nginx
nginx_port: 8080 # 覆盖默认变量
tags: [nginx]
- role: app
tags: [app]
5.2 其他自动化工具对比#
| 特性 |
Ansible |
SaltStack |
Puppet |
Chef |
| 架构 |
无 agent (SSH) |
Agent + Master |
Agent + Master |
Agent + Master |
| 配置语言 |
YAML |
YAML + Python |
自定义 DSL |
Ruby DSL |
| 学习曲线 |
低 |
中 |
高 |
高 |
| 实时性 |
推送模型 |
事件驱动(快) |
拉取(30min) |
拉取(30min) |
| 社区 |
最大 |
中型 |
大型 |
中型 |
| 适用场景 |
通用/中小规模 |
大规模/实时 |
大规模合规 |
大规模/复杂 |
第六章:安全加固#
6.1 系统基础安全#
# === 1. SSH 加固 ===
# /etc/ssh/sshd_config
Port 2222 # 修改默认端口
Protocol 2 # 仅 SSHv2
PermitRootLogin no # 禁止 root 登录
PasswordAuthentication no # 禁用密码认证
PubkeyAuthentication yes # 仅密钥认证
MaxAuthTries 3 # 最大尝试次数
ClientAliveInterval 300 # 客户端心跳
ClientAliveCountMax 2 # 推送周期
AllowUsers [email protected].* # 限制用户和来源 IP
X11Forwarding no # 禁止 X11 转发
MaxSessions 5 # 单连接最大会话数
LoginGraceTime 30 # 认证超时
MaxStartups 10:30:60 # 未认证连接限制
systemctl restart sshd
# === 2. 密码策略 ===
# /etc/login.defs
PASS_MAX_DAYS 90 # 密码 90 天过期
PASS_MIN_DAYS 7 # 修改后 7 天内不可再改
PASS_MIN_LEN 12 # 最小 12 位
PASS_WARN_AGE 14 # 过期前 14 天警告
# /etc/security/pwquality.conf
minlen = 12
dcredit = -1 # 至少 1 个数字
ucredit = -1 # 至少 1 个大写
lcredit = -1 # 至少 1 个小写
ocredit = -1 # 至少 1 个特殊字符
minclass = 4 # 至少 4 种字符类别
maxrepeat = 3 # 最多连续重复 3 次
maxclassrepeat = 3 # 同类字符最多连续 3 个
difok = 5 # 新密码与旧密码至少不同 5 个字符
enforce_for_root # root 也适用
# === 3. 账户锁定 ===
# /etc/pam.d/sshd
# 连续 5 次失败后锁定 600 秒
auth required pam_tally2.so deny=5 unlock_time=600 onerr=fail audit
# === 4. 会话超时 ===
echo "TMOUT=600" >> /etc/profile
echo "readonly TMOUT" >> /etc/profile
echo "export TMOUT" >> /etc/profile
# === 5. 历史命令限制 ===
echo 'HISTSIZE=500' >> /etc/profile
echo 'HISTFILESIZE=500' >> /etc/profile
echo "readonly HISTSIZE HISTFILESIZE" >> /etc/profile
echo 'export HISTTIMEFORMAT="%F %T "' >> /etc/profile
# === 6. 限制 su/sudo ===
# 仅 wheel 组可 su
# /etc/pam.d/su
auth required pam_wheel.so use_uid
# sudo 日志审计
# /etc/sudoers
Defaults logfile=/var/log/sudo.log
Defaults log_input,log_output # 记录输入输出 (需 sudo 1.9+)
Defaults requiretty # 必须有 tty
6.2 防火墙管理#
firewalld (RHEL/CentOS 7+)#
# === 基础操作 ===
systemctl start firewalld
systemctl enable firewalld
firewall-cmd --state
# 查看
firewall-cmd --list-all # 当前区域详情
firewall-cmd --get-default-zone # 默认区域
firewall-cmd --get-active-zones # 活动区域
firewall-cmd --list-services # 已允许服务
firewall-cmd --list-ports # 已允许端口
# 规则管理
firewall-cmd --add-port=8080/tcp --permanent # 永久开放端口
firewall-cmd --add-service=http --permanent # 开放服务
firewall-cmd --add-rich-rule='rule family="ipv4" source address="192.168.1.0/24" port port="22" protocol="tcp" accept' --permanent # 仅允许特定 IP 段访问 SSH
firewall-cmd --remove-port=8080/tcp --permanent # 删除
# 重载
firewall-cmd --reload
# 区域切换
firewall-cmd --set-default-zone=dmz
firewall-cmd --change-interface=ens33 --zone=trusted --permanent
# === 生产级规则示例 ===
# 默认拒绝并只开放必要端口
firewall-cmd --set-default-zone=drop
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port port="22" protocol="tcp" accept'
firewall-cmd --permanent --add-service=http
firewall-cmd --permanent --add-service=https
firewall-cmd --permanent --add-port=9100/tcp # node_exporter
firewall-cmd --reload
# === 端口转发 (NAT) ===
firewall-cmd --permanent --add-masquerade
firewall-cmd --permanent --add-forward-port=port=80:proto=tcp:toport=8080:toaddr=192.168.1.100
firewall-cmd --reload
iptables (传统/通用)#
# === 默认策略 ===
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT ACCEPT
# === 允许回环 ===
iptables -A INPUT -i lo -j ACCEPT
# === 允许已建立连接 ===
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
# === 允许 SSH ===
iptables -A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT
# === 允许 Web ===
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
# === 允许 ICMP (ping) ===
iptables -A INPUT -p icmp --icmp-type echo-request -j ACCEPT
# === 防 DDoS ===
# 限制 SYN 包速率
iptables -A INPUT -p tcp --syn -m limit --limit 10/s --limit-burst 20 -j ACCEPT
iptables -A INPUT -p tcp --syn -j DROP
# 限制单个 IP 并发连接
iptables -A INPUT -p tcp --dport 80 -m connlimit --connlimit-above 50 -j DROP
# === 端口转发 ===
iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-destination 192.168.1.100:8080
iptables -t nat -A POSTROUTING -j MASQUERADE
# === 保存规则 ===
iptables-save > /etc/sysconfig/iptables # CentOS 6
iptables-save > /etc/iptables/rules.v4 # Debian/Ubuntu
netfilter-persistent save # iptables-persistent
6.3 SELinux 管理#
# === 状态查看 ===
getenforce # 查看模式
sestatus -v # 详细状态
# === 模式切换 ===
setenforce 0 # 临时切换为 Permissive
setenforce 1 # 启用 Enforcing
# 永久配置
# /etc/selinux/config
# SELINUX=enforcing|permissive|disabled
# === 上下文管理 ===
ls -Z /var/www/html/ # 查看文件上下文
ps -Z # 查看进程上下文
chcon -R -t httpd_sys_content_t /var/www/html/ # 修改上下文
restorecon -Rv /var/www/html/ # 恢复默认上下文
semanage fcontext -a -t httpd_sys_content_t "/web(/.*)?"
restorecon -Rv /web
# === 布尔值管理 ===
getsebool -a # 列出所有布尔值
setsebool -P httpd_can_network_connect on # 允许 Apache 网络连接
setsebool -P httpd_enable_homedirs on # 允许用户目录
# === 端口管理 ===
semanage port -l # 列出所有端口
semanage port -a -t http_port_t -p tcp 8080 # 添加端口到类型
# === 审计日志排错 ===
ausearch -m avc -ts recent # 查看最近的 AVC 拒绝
sealert -a /var/log/audit/audit.log # 分析审计日志
audit2allow -a -M mypol # 从审计日志生成策略模块
semodule -i mypol.pp # 安装自定义模块
# === SELinux 排错流程 ===
# 1. 查看审计日志
ausearch -m avc -ts today | grep denied
# 2. 分析并提供建议
audit2why < /var/log/audit/audit.log
# 3. 临时切换 permissive 排查
setenforce 0
# 4. 测试应用
# 5. 查看生成的 AVC
ausearch -m avc -ts recent
# 6. 创建自定义策略
grep denied /var/log/audit/audit.log | audit2allow -M custom_policy
semodule -i custom_policy.pp
# 7. 恢复 enforcing
setenforce 1
6.4 fail2ban (防暴力破解)#
# 安装
yum install -y fail2ban # CentOS
apt-get install -y fail2ban # Ubuntu
# /etc/fail2ban/jail.local
[DEFAULT]
ignoreip = 127.0.0.1/8 10.0.0.0/8 192.168.0.0/16
bantime = 3600 # 封禁时间 (秒)
findtime = 600 # 统计窗口 (秒)
maxretry = 5 # 最大失败次数
destemail = [email protected]
action = %(action_mw)s # 封禁 + whois + 邮件
banaction = iptables-multiport
[sshd]
enabled = true
port = ssh,2222
logpath = %(sshd_log)s
maxretry = 3
[nginx-http-auth]
enabled = true
port = http,https
logpath = /var/log/nginx/error.log
maxretry = 5
[nginx-botsearch]
enabled = true
port = http,https
logpath = /var/log/nginx/access.log
maxretry = 3
findtime = 300
[mysqld-auth]
enabled = true
port = 3306
logpath = /var/log/mysql/error.log
maxretry = 5
# 管理命令
fail2ban-client status # 查看状态
fail2ban-client status sshd # 查看 sshd jail
fail2ban-client set sshd unbanip 1.2.3.4 # 手动解封
fail2ban-client set sshd banip 1.2.3.4 # 手动封禁
fail2ban-client reload # 重载配置
# 查看封禁日志
grep "Ban" /var/log/fail2ban.log
6.5 系统审计 (auditd)#
# 安装
yum install -y audit # CentOS
apt-get install -y auditd # Ubuntu
systemctl enable --now auditd
# === 审计规则 ===
# /etc/audit/rules.d/audit.rules
# 监控关键文件
-w /etc/passwd -p wa -k identity_changes
-w /etc/shadow -p wa -k identity_changes
-w /etc/group -p wa -k identity_changes
-w /etc/sudoers -p wa -k sudo_changes
-w /etc/ssh/sshd_config -p wa -k sshd_config
-w /etc/crontab -p wa -k cron_changes
# 监控关键命令执行
-a always,exit -F path=/usr/bin/su -F perm=x -k su_exec
-a always,exit -F path=/usr/bin/sudo -F perm=x -k sudo_exec
# 监控系统调用
-a always,exit -F arch=b64 -S execve -k command_exec
# 监控网络配置修改
-a always,exit -F path=/sbin/ifconfig -F perm=x -k net_config
# 监控时间修改
-a always,exit -F arch=b64 -S adjtimex -S settimeofday -k time_change
-a always,exit -F arch=b64 -S clock_settime -k time_change
# === 审计查询 ===
ausearch -k identity_changes # 按 key 查询
ausearch -f /etc/passwd # 按文件查询
ausearch -p 1234 # 按 PID 查询
ausearch -ua root # 按用户查询
ausearch -ts today # 今天的审计记录
ausearch -m USER_LOGIN # 登录事件
# 生成报告
aureport -l # 登录报告
aureport -k # key 汇总
aureport -f # 文件审计报告
aureport --summary # 摘要
# 搜索失败事件
ausearch -m USER_LOGIN --success no
6.6 安全扫描#
# === Lynis (系统安全审计) ===
# 安装
git clone https://github.com/CISOfy/lynis
cd lynis && ./lynis audit system
# 快速审计
lynis audit system --quick
# === ClamAV (病毒扫描) ===
# 安装
yum install -y clamav clamav-update # CentOS
apt-get install -y clamav # Ubuntu
freshclam # 更新病毒库
clamscan -r /data # 递归扫描
clamscan -r --remove /tmp # 扫描并删除
# 定期扫描 crontab
# 0 2 * * 0 clamscan -r /data --log=/var/log/clamav/scan.log
# === AIDE (文件完整性检查) ===
yum install -y aide
aide --init # 初始化数据库
cp /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
aide --check # 检查变更
aide --update # 更新基线数据库
# === OpenSCAP (合规检查) ===
yum install -y openscap-scanner scap-security-guide
# 检查系统合规性 (CIS 基线)
oscap xccdf eval \
--profile xccdf_org.ssgproject.content_profile_cis \
--results scan-results.xml \
--report scan-report.html \
/usr/share/xml/scap/ssg/content/ssg-rhel8-ds.xml
第七章:性能调优#
7.1 内核参数调优#
# === 核心内核参数 (/etc/sysctl.d/99-tuning.conf) ===
# ===== 网络调优 =====
# TCP 连接复用 (快速回收 TIME_WAIT)
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 0 # 4.12+ 已移除,设为 0
# TIME_WAIT 与端口范围
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_max_tw_buckets = 10000
net.ipv4.ip_local_port_range = 1024 65000
# TCP 缓冲区 (高吞吐场景)
net.core.rmem_max = 134217728 # 128MB
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.ipv4.tcp_mem = 50576 64768 98152
# TCP Fast Open
net.ipv4.tcp_fastopen = 3 # 客户端 + 服务端
# BBR 拥塞控制 (4.9+)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr
# 连接队列
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 8192
net.core.netdev_max_backlog = 10000
# Keepalive
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3
# SYN Flood 防护
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_syn_retries = 3
# ===== 文件系统与 IO =====
# VM 参数
vm.swappiness = 1 # 尽量不用 swap (SSD 推荐 1)
vm.dirty_ratio = 30
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.vfs_cache_pressure = 50 # 保留更多 inode/dentry 缓存
vm.min_free_kbytes = 131072 # 128MB 最小空闲内存
# 文件描述符
fs.file-max = 655350
fs.nr_open = 1048576
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 1024
fs.aio-max-nr = 1048576
# ===== 内核调度 =====
kernel.pid_max = 4194303
kernel.threads-max = 256000
kernel.msgmax = 65536
kernel.msgmnb = 65536
# 应用配置
sysctl -p /etc/sysctl.d/99-tuning.conf
7.2 资源限制#
# /etc/security/limits.conf
# <domain> <type> <item> <value>
* soft nofile 65535
* hard nofile 65535
* soft nproc 65535
* hard nproc 65535
root soft nofile 65535
root hard nofile 65535
nginx soft nofile 100000
nginx hard nofile 100000
mysql soft nofile 100000
mysql hard nofile 100000
# systemd 服务资源限制
# /etc/systemd/system/myservice.service.d/limits.conf
[Service]
LimitNOFILE=65535
LimitNPROC=65535
LimitCORE=infinity
MemoryLimit=2G
CPUQuota=200%
TasksMax=2048
7.3 CPU 性能优化#
# === CPU 调度策略 ===
# 查看当前策略
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# 设为 performance (服务器推荐)
# CentOS/RHEL
echo "GOVERNOR=performance" >> /etc/sysconfig/cpufreq
# 或直接
for CPU in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $CPU
done
# === CPU 亲和性 (IRQ 绑定) ===
# 查看中断
cat /proc/interrupts
# 将网卡中断绑定到特定 CPU
echo 2 > /proc/irq/89/smp_affinity # 绑定到 CPU1
# 使用 irqbalance (自动平衡,一般开启即可)
systemctl enable --now irqbalance
# NUMA 感知
numactl --hardware # 查看 NUMA 拓扑
numactl --cpunodebind=0 --membind=0 nginx # 绑定到 NUMA node 0
# === CPU 隔离 (实时/低延迟场景) ===
# /etc/default/grub
# GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"
# 然后: grub2-mkconfig -o /boot/grub2/grub.cfg
7.4 内存调优#
# === 查看内存状况 ===
free -h
cat /proc/meminfo
vmstat 1 10
# === 查看进程内存详情 ===
# PSS (比例分摊共享内存)
smem -r -s pss
# 大页内存 (HugePages)
# 适合大内存数据库
echo "vm.nr_hugepages = 1024" >> /etc/sysctl.d/99-hugepages.conf
sysctl -p /etc/sysctl.d/99-hugepages.conf
# 查看大页使用
cat /proc/meminfo | grep Huge
# transparent hugepage (数据库通常建议关闭)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# === OOM Killer 控制 ===
# 保护关键进程
echo -1000 > /proc/$(pgrep sshd)/oom_score_adj # 永不 kill (范围 -1000 ~ 1000)
# /etc/systemd/system/mysqld.service.d/oom.conf
[Service]
OOMScoreAdjust=-800
# === 内存泄漏排查 ===
# 监控进程内存增长
while true; do
ps -eo pid,ppid,cmd,%mem,%cpu,rss --sort=-rss | head -20
sleep 10
done
7.5 磁盘 IO 调优#
# === IO 调度器 ===
# 查看当前调度器
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none
# 设置 (SSD 推荐 none / mq-deadline)
echo none > /sys/block/sda/queue/scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler
# 永久设置 (grub)
# GRUB_CMDLINE_LINUX="elevator=noop"
# === 磁盘队列深度 ===
cat /sys/block/sda/queue/nr_requests
echo 1024 > /sys/block/sda/queue/nr_requests
# === 预读大小 ===
blockdev --getra /dev/sda
blockdev --setra 8192 /dev/sda # 设为 4MB (8192 个扇区)
# === 文件系统挂载选项 ===
# SSD 优化
# /etc/fstab
UUID=xxx /data ext4 defaults,noatime,nodiratime,discard 0 0
# noatime : 不记录访问时间
# nodiratime : 不记录目录访问时间
# discard : 启用 TRIM (或使用 fstrim)
# nobarrier : 关闭写屏障 (有电池的 RAID 卡)
# === fstrim (SSD TRIM) ===
fstrim -v /data # 手动 TRIM
systemctl enable fstrim.timer # 启动定时 TRIM
# === IO 性能测试 ===
fio --name=randwrite --ioengine=libaio --iodepth=32 --rw=randwrite \
--bs=4k --size=2G --numjobs=4 --runtime=60 --group_reporting \
--filename=/data/test --direct=1
fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread \
--bs=4k --size=2G --numjobs=4 --runtime=60 --group_reporting \
--filename=/data/test --direct=1
# === iotop (实时 IO 监控) ===
iotop -o # 仅显示有 IO 的进程
iotop -oP # 进程级别
iotop -b -n 3 # 批处理模式,3 次
7.6 性能分析工具速查#
| 工具 |
用途 |
典型命令 |
top/htop |
进程监控 |
htop -u mysql |
vmstat |
内存/IO/CPU |
vmstat 1 10 |
iostat |
磁盘 IO |
iostat -xz 1 |
sar |
系统活动报告 |
sar -n DEV 1 |
mpstat |
CPU 统计 |
mpstat -P ALL 1 |
pidstat |
进程性能 |
pidstat -d 1 |
perf |
性能采样 |
perf top -g |
strace |
系统调用追踪 |
strace -c -p PID |
ltrace |
库调用追踪 |
ltrace -p PID |
bpftrace |
动态追踪 |
bpftrace -e 'kprobe:vfs_read { @[comm]=count(); }' |
dstat |
综合系统资源 |
dstat -tcmdns |
nethogs |
进程网络流量 |
nethogs ens33 |
iperf3 |
网络带宽测试 |
iperf3 -s / iperf3 -c host |
第八章:高可用与负载均衡#
8.1 Keepalived (VRRP 高可用)#
8.1.1 原理#
VIP: 192.168.1.100 (虚拟 IP)
┌─────────────────┐ ┌─────────────────┐
│ Master │ │ Backup │
│ 192.168.1.11 │────▶│ 192.168.1.12 │
│ priority=100 │ VRRP│ priority=90 │
└─────────────────┘ └─────────────────┘
│ │
└───────────┬───────────┘
│
┌─────┴─────┐
│ 后端服务 │
│ 192.168.1.20│
└───────────┘
8.1.2 Keepalived 配置#
yum install -y keepalived # CentOS
apt-get install -y keepalived # Ubuntu
# /etc/keepalived/keepalived.conf (Master)
global_defs {
router_id web_lb_01
# 通知脚本
notification_email {
[email protected]
}
notification_email_from [email protected]
smtp_server smtp.example.com
smtp_connect_timeout 30
}
# 健康检查脚本
vrrp_script chk_nginx {
script "/usr/bin/killall -0 nginx" # 检查 nginx 进程
interval 2
weight -20
fall 3 # 连续 3 次失败触发切换
rise 2 # 连续 2 次成功恢复
}
vrrp_instance VI_1 {
state MASTER
interface ens33
virtual_router_id 51
priority 100
advert_int 1
nopreempt # 不抢占 (故障恢复后不自动切回)
authentication {
auth_type PASS
auth_pass your_password
}
virtual_ipaddress {
192.168.1.100/24 dev ens33
}
track_script {
chk_nginx # 关联健康检查
}
# 状态切换通知
notify_master "/opt/scripts/notify.sh master"
notify_backup "/opt/scripts/notify.sh backup"
notify_fault "/opt/scripts/notify.sh fault"
}
# /etc/keepalived/keepalived.conf (Backup)
global_defs {
router_id web_lb_02
}
vrrp_script chk_nginx {
script "/usr/bin/killall -0 nginx"
interval 2
weight -20
fall 3
rise 2
}
vrrp_instance VI_1 {
state BACKUP
interface ens33
virtual_router_id 51
priority 90
advert_int 1
authentication {
auth_type PASS
auth_pass your_password
}
virtual_ipaddress {
192.168.1.100/24 dev ens33
}
track_script {
chk_nginx
}
}
# 允许非本地 IP 绑定 (使得 Backup 也能绑定 VIP)
# echo "net.ipv4.ip_nonlocal_bind = 1" >> /etc/sysctl.d/99-keepalived.conf
# sysctl -p /etc/sysctl.d/99-keepalived.conf
8.2 HAProxy 负载均衡#
# 安装
yum install -y haproxy # CentOS
apt-get install -y haproxy # Ubuntu
# === 完整配置 ===
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
user haproxy
group haproxy
daemon
maxconn 50000
spread-checks 5
stats socket /var/run/haproxy.sock mode 600 level admin
stats timeout 2m
tune.ssl.default-dh-param 2048
defaults
log global
mode http
option httplog
option dontlognull
option redispatch
retries 3
timeout connect 5s
timeout client 50s
timeout server 50s
timeout http-request 10s
timeout http-keep-alive 10s
timeout check 5s
maxconn 5000
# === 前端 ===
frontend web_frontend
bind *:80
bind *:443 ssl crt /etc/haproxy/certs/combined.pem alpn h2,http/1.1
# HTTP 重定向到 HTTPS
redirect scheme https if !{ ssl_fc }
# ACL
acl is_api path_beg /api
acl is_admin path_beg /admin
acl is_static path_end .jpg .png .css .js .woff2
acl blocked_ua hdr_sub(User-Agent) -i curl wget
# 按路径路由
use_backend api_backend if is_api
use_backend admin_backend if is_admin
use_backend static_backend if is_static
default_backend web_backend
# 拒绝特定 User-Agent
http-request deny if blocked_ua
# 限速 (每 IP 每秒 100 请求)
stick-table type ip size 1m expire 10s store http_req_rate(10s)
http-request track-sc0 src
http-request deny if { sc_http_req_rate(0) gt 100 }
# === 后端 ===
backend web_backend
balance roundrobin
option httpchk GET /health
http-check expect status 200
default-server inter 3s rise 2 fall 3 maxconn 1000
server web01 192.168.1.11:8080 check weight 100
server web02 192.168.1.12:8080 check weight 100
server web03 192.168.1.13:8080 check weight 100 backup # 备用节点
# Cookie 会话保持
cookie SERVERID insert indirect nocache
# 长连接
option http-keep-alive
api_backend
balance leastconn
option httpchk GET /api/health
http-check expect status 200
default-server inter 2s rise 2 fall 2
server api01 192.168.1.11:8081 check
server api02 192.168.1.12:8081 check
static_backend
balance uri
option httpchk HEAD /health
server static01 192.168.1.11:8082 check
server static02 192.168.1.12:8082 check
# === TCP 模式 (MySQL 代理) ===
listen mysql_proxy
bind *:3307
mode tcp
balance leastconn
option mysql-check user haproxy_check
server db01 192.168.1.21:3306 check inter 3s
server db02 192.168.1.22:3306 check inter 3s backup
# === 统计页面 ===
listen stats
bind *:9000
mode http
stats enable
stats uri /stats
stats realm HAProxy\ Statistics
stats auth admin:your_password
stats refresh 10s
stats admin if TRUE
负载均衡算法对比#
| 算法 |
适用场景 |
说明 |
roundrobin |
通用 Web |
轮询,权重越大分配越多 |
leastconn |
长连接 (DB/WebSocket) |
最少连接优先 |
source |
需要会话保持 |
源 IP 哈希 |
uri |
静态文件/Cache |
URI 哈希 (配合缓存) |
url_param |
带参数路由 |
URL 参数哈希 |
hdr |
HTTP 头路由 |
基于 HTTP Header |
first |
最小连接组 |
第一台可用 |
8.3 LVS (Linux Virtual Server)#
# === DR 模式 (Direct Routing, 性能最高) ===
# Director 配置 (192.168.1.10)
ipvsadm -A -t 192.168.1.100:80 -s wrr
ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.11 -g -w 100
ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.12 -g -w 100
ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.13 -g -w 80
ipvsadm -Ln # 查看规则
ipvsadm -Sn # 保存规则
# Real Server 配置 (每台)
ifconfig lo:0 192.168.1.100 netmask 255.255.255.255 up
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
# === 调度算法 ===
# rr : 轮询
# wrr : 加权轮询
# lc : 最少连接
# wlc : 加权最少连接 (默认)
# lblc : 基于局部性最少连接
# dh : 目标哈希
# sh : 源地址哈希
8.4 高可用方案对比#
| 方案 |
层次 |
性能 |
复杂度 |
适用场景 |
| Keepalived + Nginx |
L3/L7 |
高 |
中 |
Web 服务 |
| Keepalived + HAProxy |
L3/L7 |
高 |
中 |
通用 4/7 层 |
| LVS + Keepalived |
L4 |
极高 |
高 |
大规模流量入口 |
| Nginx + Nginx |
L7 |
中 |
低 |
中小 Web |
| 云 LB (SLB/ELB) |
L4/L7 |
极高 |
低 |
云环境 |
| DNS 轮询 |
DNS |
低 |
极低 |
简单分发 |
第九章:网络诊断与排错#
9.1 网络诊断方法论#
应用层 → 检查服务状态、端口监听、应用日志
传输层 → 检查端口连通性、防火墙规则、连接状态
网络层 → 检查路由、IP 配置、ICMP 可达性
链路层 → 检查 ARP、网卡状态、交换机端口
物理层 → 检查网线、光模块、网卡灯
排错顺序 (自底向上)#
- 物理链路 (
ethtool, ip link)
- IP 配置 (
ip addr, ip route)
- 网关可达性 (
ping 网关, traceroute)
- DNS 解析 (
dig, nslookup)
- 端口连通性 (
telnet, nc, nmap)
- 服务状态 (
ss, netstat, 应用日志)
9.2 tcpdump 抓包分析#
# === 基础抓包 ===
tcpdump -i any -nn # 所有接口,不解析主机名和端口
tcpdump -i ens33 -nn host 192.168.1.100 # 过滤主机
tcpdump -i ens33 -nn port 80 # 过滤端口
tcpdump -i ens33 -nn src 192.168.1.100 # 源地址
tcpdump -i ens33 -nn dst port 443 # 目标端口
tcpdump -i ens33 -nn tcp # 仅 TCP
# === 组合过滤 ===
tcpdump -i ens33 -nn \
'(host 192.168.1.100 and port 80) or (host 192.168.1.200 and port 443)'
# === 实战场景 ===
# 抓取 HTTP 请求
tcpdump -i ens33 -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)'
# 抓取特定 HTTP 方法
tcpdump -i ens33 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420' # GET
tcpdump -i ens33 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504f5354' # POST
# 抓取 DNS 查询
tcpdump -i ens33 -nn port 53
# 抓取特定标志位
tcpdump -i ens33 -nn 'tcp[tcpflags] & (tcp-syn|tcp-fin) != 0' # SYN 或 FIN
tcpdump -i ens33 -nn 'tcp[tcpflags] & tcp-rst != 0' # RST 包
# 保存到文件
tcpdump -i ens33 -w /tmp/capture.pcap -s 0 host 192.168.1.100
tcpdump -r /tmp/capture.pcap -nn # 读取 pcap 文件
# 限制抓包数量
tcpdump -i ens33 -nn -c 100 # 抓 100 个包后停止
TCP 状态分析#
# 三次握手问题分析
tcpdump -i ens33 -nn 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'
# 大量 RST → 端口未监听 / 防火墙 REJECT
tcpdump -i ens33 -nn 'tcp[tcpflags] & tcp-rst != 0'
# 重传统计
tcpdump -i ens33 -nn 'tcp[tcpflags] & tcp-syn != 0 or tcp[tcpflags] & tcp-rst != 0'
9.3 网络故障排查工具#
# === 连通性测试 ===
ping -c 4 -i 0.2 192.168.1.1 # 快速 ping
ping -M do -s 1472 192.168.1.1 # 测试 MTU (禁止分片)
mtr -r -c 10 192.168.1.1 # 路由追踪 + 统计
# === 路由诊断 ===
ip route get 8.8.8.8 # 查看到目标的实际路由
traceroute -n 8.8.8.8 # 路由追踪
tracepath 8.8.8.8 # MTU 发现 + 路由追踪
# === DNS 诊断 ===
dig +short example.com # 简洁输出
dig example.com ANY # 所有记录
dig @8.8.8.8 example.com # 指定 DNS 服务器
dig -x 8.8.8.8 # 反向解析
nslookup example.com # 交互式查询
# === 端口检测 ===
nc -zv 192.168.1.100 80 # TCP 端口扫描
nc -zuv 192.168.1.100 53 # UDP 端口扫描
timeout 3 bash -c '</dev/tcp/192.168.1.100/80 && echo OPEN || echo CLOSED'
# === 扫描工具 ===
nmap -sS 192.168.1.0/24 # SYN 半连接扫描
nmap -sT -p 1-65535 192.168.1.100 # 全端口扫描
nmap -sV -p 80,443 192.168.1.100 # 服务版本探测
nmap -A 192.168.1.100 # 综合扫描 (OS + 服务 + 脚本)
# === 连接状态分析 ===
ss -s # 连接统计摘要
ss -tapn # 所有 TCP 连接
ss -tlnp # 监听端口
ss -tan state time-wait # TIME_WAIT 状态
ss -tan state established # 已建立连接
# 统计各状态连接数
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn
# 统计每个 IP 的连接数
ss -tan | awk 'NR>1{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -20
9.4 带宽与延迟分析#
# === iperf3 带宽测试 ===
ipserf3 -s # 服务端
iperf3 -c server_ip -t 30 -P 4 # 4 并发, 30 秒
iperf3 -c server_ip -R # 反向 (下载)
iperf3 -c server_ip -u -b 100M # UDP 100Mbps
# === 网卡统计 ===
ethtool -S ens33 # 网卡详细统计
ethtool ens33 # 网卡设置
ethtool -g ens33 # Ring buffer 大小
# === 实时流量 ===
iftop -i ens33 # 实时带宽
nload ens33 # 实时流量图
# === HTTP 延迟分析 ===
curl -w "time_namelookup: %{time_namelookup}\ntime_connect: %{time_connect}\ntime_starttransfer: %{time_starttransfer}\ntime_total: %{time_total}\n" -o /dev/null -s https://example.com
9.5 常见网络问题速查#
| 现象 |
可能原因 |
检查命令 |
| ping 通但端口不通 |
防火墙/服务未启动 |
ss -tlnp, firewall-cmd --list-ports |
| 间歇性丢包 |
网卡/交换机/带宽饱和 |
ethtool -S | grep drop, netstat -s |
| TCP 连接大量 TIME_WAIT |
短连接过多 |
ss -tan state time-wait | wc -l |
| DNS 解析慢 |
DNS 服务器问题 |
dig +stats |
| SSH 连接慢 |
DNS 反向解析 |
/etc/ssh/sshd_config: UseDNS no |
| 大文件传输慢 |
MTU/TCP 窗口 |
tracepath, 调整 tcp_rmem/wmem |
| 大量 SYN_RECV |
SYN Flood / backlog 不够 |
ss -tan state syn-recv, tcp_max_syn_backlog |
| curl 卡住不动 |
防火墙 DROP (无 RST) |
tcpdump 确认是否收到 SYN-ACK |
第十章:故障排查实战#
10.1 CPU 飙升排查#
# 1. 确认高 CPU 进程
top -bn1 -o %CPU | head -20
# 2. 查看进程中的高 CPU 线程
top -H -p <PID>
# 3. 线程 ID 转十六进制 (用于 Java 线程 dump)
printf "%x\n" <TID>
# 4. 查看系统调用
strace -c -p <PID> # 统计系统调用耗时
strace -p <PID> -T # 显示每个调用耗时
# 5. perf 分析
perf top -g -p <PID> # 实时采样
perf record -g -p <PID> -- sleep 30 # 记录 30 秒
perf report # 查看报告
# 6. Java 应用
jstack <PID> # 线程 dump
jstack <PID> | grep -A 20 "0x$(printf "%x" <TID>)"
10.2 内存问题排查#
# 1. 概览
free -h
cat /proc/meminfo
# 2. 进程内存排序
ps aux --sort=-%mem | head -20
ps -eo pid,ppid,cmd,%mem,%cpu,rss --sort=-rss | head -20
# 3. 进程内存详情
cat /proc/<PID>/smaps | grep -E "^(Rss|Pss|Swap):" | awk '{sum+=$2} END {print sum/1024" MB"}'
cat /proc/<PID>/status | grep -E "Vm|Threads"
# 4. 检查是否有内存泄漏
for i in {1..10}; do
cat /proc/<PID>/status | grep VmRSS
sleep 5
done
# 5. slab 内存 (内核)
slabtop -s c
# 6. 查看 OOM 历史
dmesg | grep -i "out of memory"
grep -i "killed process" /var/log/messages
journalctl -k | grep -i oom
# 7. OOM Killer 保护关键进程
echo -1000 > /proc/$(pgrep sshd)/oom_score_adj
10.3 磁盘空间问题#
# === 磁盘满排查流程 ===
# 1. 确认磁盘使用
df -h
# 2. 哪个目录占用大
du -sh /* 2>/dev/null | sort -rh | head -20
du -sh /var/* 2>/dev/null | sort -rh | head -10
# 3. 大文件查找
find / -type f -size +500M -exec ls -lh {} \; 2>/dev/null
find / -type f -size +1G 2>/dev/null
# 4. 已删除但未释放的文件 (进程仍持有)
lsof | grep deleted | awk '{print $1,$2,$7}' | sort -u
lsof +L1 | grep deleted
# 5. inode 耗尽检查
df -i
find / -xdev -type f | cut -d/ -f2 | sort | uniq -c | sort -rn | head -20
# 6. 快速清理
journalctl --vacuum-size=500M
find /var/log -type f -name "*.log" -mtime +30 -delete
yum clean all || apt-get clean
find /tmp -type f -mtime +7 -delete
docker system prune -af 2>/dev/null
10.4 服务无法启动排查#
# === 排查流程 ===
# 1. 查看服务日志
journalctl -u <service> -n 100 --no-pager
systemctl status <service> -l
# 2. 查看系统日志
tail -n 200 /var/log/messages
dmesg | tail -50
# 3. 检查端口冲突
ss -tlnp | grep <PORT>
# 4. 检查文件权限
ls -la /path/to/app/
ls -laZ /path/to/app/ # SELinux 上下文
# 5. 检查依赖
ldd /path/to/binary # 库依赖
# 6. 手动启动排查
sudo -u <user> /path/to/binary # 看报错信息
# 7. SELinux 排查
ausearch -m avc -ts recent
grep denied /var/log/audit/audit.log | tail -20
setenforce 0 # 临时禁用测试
# 测试后恢复
setenforce 1
# 8. 资源限制检查
cat /proc/<PID>/limits
ulimit -a
10.5 网站访问慢排查#
# 1. DNS 解析
dig +stats example.com
# 2. TCP 连接
time nc -zv example.com 443
# 3. HTTP 全链路耗时
curl -w "time_namelookup: %{time_namelookup}\ntime_connect: %{time_connect}\ntime_appconnect: %{time_appconnect}\ntime_starttransfer: %{time_starttransfer}\ntime_total: %{time_total}\n" -o /dev/null -s https://example.com
# 4. SSL 握手
echo | openssl s_client -connect example.com:443 -servername example.com 2>&1 | grep -E "Verify|time|session"
# 5. 后端耗时分析
tail -1000 /var/log/nginx/access.log | awk '{print $NF}' | sort -rn | head -20
# 6. 数据库慢查询
mysql -e "SHOW FULL PROCESSLIST;"
psql -c "SELECT pid, now() - query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;"
# 7. 系统资源瓶颈
top -bn1 | head -5
iostat -xz 1 5
sar -n DEV 1 5
10.6 应急诊断脚本#
#!/bin/bash
# emergency_diag.sh - 应急诊断,收集关键信息
# 用法: bash emergency_diag.sh > diag_$(date +%Y%m%d_%H%M).txt
echo "========== 诊断开始: $(date) =========="
echo "主机: $(hostname)"
echo
echo "=== 系统负载 ==="
uptime
echo
echo "=== CPU TOP 10 ==="
ps aux --sort=-%cpu | head -11
echo
echo "=== 内存 TOP 10 ==="
ps aux --sort=-%mem | head -11
echo
echo "=== 内存概览 ==="
free -h
echo
echo "=== 磁盘使用 ==="
df -h
echo
echo "=== inode 使用 ==="
df -i
echo
echo "=== IO 统计 ==="
iostat -xz 1 3
echo
echo "=== 网络监听 ==="
ss -tlnp
echo
echo "=== 连接统计 ==="
ss -s
echo
echo "=== TIME_WAIT 数量 ==="
ss -tan state time-wait | wc -l
echo
echo "=== 各状态连接数 ==="
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn
echo
echo "=== 最近系统日志 (error) ==="
journalctl -p err -n 50 --no-pager
echo
echo "=== 内核日志 (最近) ==="
dmesg | tail -30
echo
echo "=== OOM 记录 ==="
dmesg | grep -i "out of memory"
echo
echo "===== 诊断结束: $(date) ====="
第十一章:容器化运维#
11.1 Docker 运维要点#
11.1.1 Docker 资源限制#
# === 内存限制 ===
docker run -d --memory="512m" --memory-swap="1g" nginx
# === CPU 限制 ===
docker run -d --cpus="1.5" --cpu-shares=512 nginx
# === 限制验证 ===
docker stats <container>
docker inspect <container> | jq '.[0].HostConfig.Memory'
# Docker Compose 资源限制
services:
app:
image: myapp:latest
deploy:
resources:
limits:
cpus: '2'
memory: 2G
reservations:
cpus: '0.5'
memory: 512M
11.1.2 Docker 运维命令速查#
# === 清理 ===
docker system df # 磁盘使用
docker system prune -af --volumes # 清理所有未使用资源
docker builder prune -a -f # 清理构建缓存
# === 日志 ===
docker logs --tail 100 -f <container>
docker logs --since 10m -f <container>
# 限制日志大小 (/etc/docker/daemon.json)
{
"log-driver": "json-file",
"log-opts": {
"max-size": "10m",
"max-file": "3"
}
}
# === 调试 ===
docker exec -it <container> sh
docker inspect <container> | jq .
docker stats --no-stream
docker cp <container>:/path/file ./local/path
# === 导出/导入 ===
docker export <container> -o container.tar
docker save <image> -o image.tar
docker load -i image.tar
11.1.3 Docker 生产环境 daemon.json#
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": { "max-size": "10m", "max-file": "3" },
"storage-driver": "overlay2",
"registry-mirrors": ["https://mirror.ccs.tencentyun.com"],
"max-concurrent-downloads": 10,
"max-concurrent-uploads": 5,
"live-restore": true,
"userland-proxy": false,
"default-ulimits": {
"nofile": { "Name": "nofile", "Hard": 65535, "Soft": 65535 }
},
"oom-score-adjust": -500
}
11.2 Docker 故障排查#
# 容器反复重启
docker logs --tail 50 <container>
docker inspect <container> --format '{{.State.OOMKilled}}'
# 检查退出码
docker inspect <container> --format '{{.State.ExitCode}}'
# 0: 正常退出, 137: SIGKILL(OOM/手动), 143: SIGTERM
# Docker 服务问题
journalctl -u docker -n 100
docker system df -v
# 清理构建缓存
docker builder prune --all --force --keep-storage 10GB
11.3 K8s 运维速查#
(详细内容参见 Kubernetes-使用手册.md)
# === 节点管理 ===
kubectl get nodes -o wide
kubectl describe node <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl cordon <node>
kubectl uncordon <node>
# === Pod 调试 ===
kubectl describe pod <pod>
kubectl logs -f <pod> --tail=100
kubectl logs -f <pod> --previous
kubectl exec -it <pod> -- sh
kubectl debug -it <pod> --image=busybox --target=<container>
# === 资源使用 ===
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory
# === 事件 ===
kubectl get events -A --sort-by='.lastTimestamp'
kubectl get events -A --field-selector type=Warning
# === etcd 备份 ===
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
第十二章:信创系统运维#
12.1 信创操作系统概览#
| 系统 |
基础 |
包管理 |
内核版本 |
适用场景 |
| 麒麟 V10 |
openEuler/Debian |
dpkg/rpm |
4.19/5.10 |
党政/国防 |
| 统信 UOS |
Deepin/Debian |
dpkg |
5.10 |
党政/企业桌面 |
| openEuler |
独立(华为) |
rpm(dnf) |
5.10/6.6 |
服务器/云计算 |
| Anolis OS |
CentOS 兼容 |
rpm(dnf) |
5.10 |
服务器替代 CentOS |
| TencentOS |
CentOS 兼容 |
rpm(yum) |
5.4 |
腾讯云 |
| openSUSE 龙架构 |
openSUSE |
rpm(zypper) |
6.x |
龙芯平台 |
12.2 麒麟 V10 运维#
# 版本查看
cat /etc/kylin-release
cat /proc/version
# 包管理 (SP1 基于 Debian, SP2/SP3 基于 openEuler)
# Debian 系列
apt-get update && apt-get install -y <package>
# openEuler 系列
dnf install -y <package>
# 安全策略 (默认启用安全加固)
getenforce # SELinux 状态
aa-status # AppArmor (Debian 系列)
# 国内源配置
# /etc/apt/sources.list (Debian 系列)
deb http://archive.kylinos.cn/kylin/KYLIN-ALL 10.1 main restricted universe multiverse
12.3 统信 UOS 运维#
# 版本信息
cat /etc/os-version
cat /etc/deepin-version
# 包管理 (基于 Debian)
apt-get update && apt-get install -y <package>
# 开发者模式 (安装未经签名的包)
# 设置 → 通用 → 开发者模式
# 与标准 Debian 的主要区别
# 1. 内置安全加固 (安全中心)
# 2. 默认使用 Deepin 桌面
# 3. 部分包名不同 (deepin-terminal 替代 gnome-terminal)
# 4. 应用商店仅包含适配的国产软件
12.4 openEuler 运维#
# 版本信息
cat /etc/openEuler-release
# 包管理 (dnf)
dnf makecache
dnf install -y <package>
dnf groupinstall -y "Development Tools"
# A-Tune (智能性能调优)
dnf install -y atune atune-engine
atune-adm list # 查看优化模板
atune-adm analyze # 系统分析
# iSulad (轻量容器引擎, Docker 替代)
dnf install -y iSulad
systemctl enable --now isulad
isula run -d nginx
# 内核特性 (默认启用 BBR)
sysctl net.ipv4.tcp_congestion_control
12.5 Anolis OS 运维 (CentOS 迁移)#
# === 从 CentOS 8 迁移到 Anolis OS ===
# 1. 备份
cp -r /etc/yum.repos.d /etc/yum.repos.d.bak
# 2. 安装迁移工具
wget https://mirrors.openanolis.cn/anolis/migration/anolis-migration.repo -O /etc/yum.repos.d/anolis-migration.repo
yum install -y anolis-migration
# 3. 执行迁移
anolis-migration --os-release 8
# 4. 重启并验证
reboot
cat /etc/anolis-release
uname -r
# Anolis OS 8.x 保持与 CentOS 8 完全兼容
# yum/dnf 源已替换为 openanolis 源,业务无需修改即可运行
12.6 信创系统通用运维注意事项#
# 1. 架构差异 (ARM64/LoongArch)
uname -m
# x86_64 / aarch64 / loongarch64
# 编译软件时指定架构
./configure --build=aarch64-linux-gnu
# 2. 包名差异
dnf search <keyword> || apt-cache search <keyword>
# 3. 安全加固 (默认更严格)
lsmod | grep -E "selinux|apparmor"
# 可能需要放宽的应用场景
semanage fcontext -a -t httpd_sys_rw_content_t "/data/app(/.*)?"
restorecon -Rv /data/app
# 4. 内核参数 (定制差异)
sysctl -a | grep -E "tcp|netfilter"
第十三章:应急响应#
13.1 应急响应流程#
发现 → 判断 → 止损 → 排查 → 恢复 → 复盘
(1min)(5min) (立即) (1h) (2h) (24h)
13.2 主机被入侵应急#
# 1. 立即隔离 (断网)
ifdown ens33 # 或 iptables -P INPUT DROP
# 2. 保留现场关键信息
w # 当前登录用户
last -20 # 登录历史
lastb -20 # 失败登录
history # 当前 shell 历史
# 3. 检查异常进程
ps auxf
ps -eo pid,ppid,user,cmd --sort=-%cpu | head -20
ls -la /proc/*/exe 2>/dev/null | grep deleted # 已删除的可执行文件
# 4. 检查异常网络连接
ss -tanp
ss -tanp | grep ESTAB | awk '{print $5}' | sort | uniq -c | sort -rn
# 5. 检查异常文件
find / -type f -mtime -1 -ls 2>/dev/null # 近 24h 修改
find / -type f -perm -4000 -o -perm -2000 2>/dev/null # suid/sgid
find / -name ".*" -type f -size +1M 2>/dev/null # 大隐藏文件
# 6. 检查定时任务
crontab -l
ls -la /var/spool/cron/
cat /etc/crontab
# 7. 检查 SSH
cat /etc/ssh/sshd_config | grep -v "^#" | grep -v "^$"
cat /root/.ssh/authorized_keys
find / -name "authorized_keys" 2>/dev/null
# 8. 检查用户变化
grep -E ":0:" /etc/passwd # UID 0 的用户
lastlog
13.3 DDoS 攻击应急#
# 1. 确认攻击特征
ss -tan | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -20
ss -tan state syn-recv | wc -l
# 2. 源 IP 封禁
iptables -I INPUT -s <IP> -j DROP
# 3. 限制单 IP 并发
iptables -A INPUT -p tcp --dport 80 -m connlimit \
--connlimit-above 50 --connlimit-mask 32 -j DROP
# 4. SYN Cookie 加固
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
# 5. Nginx 限速
# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
# limit_conn_zone $binary_remote_addr zone=addr:10m;
13.4 数据库被删库应急#
# 1. 立即停止数据库 (不要 kill -9)
systemctl stop mysqld
# 2. 备份当前所有文件
tar -czf /backup/mysql_data_$(date +%Y%m%d_%H%M).tar.gz /var/lib/mysql/
# 3. 停止应用
systemctl stop app_service
# 4. 检查备份可用性
ls -lh /backup/mysql/
# 5. 恢复最近备份 + binlog (MySQL)
mysqlbinlog --start-datetime="2024-01-01 00:00:00" binlog.000010 > recover.sql
13.5 全站 502/503 应急#
# 1. 检查后端服务
systemctl status <service>
netstat -tlnp | grep <port>
# 2. 检查资源
free -h
df -h
top -bn1 | head -5
# 3. 查看错误日志
tail -100 /var/log/nginx/error.log
tail -100 /var/log/php-fpm/error.log 2>/dev/null
# 4. 数据库连接
mysql -e "SHOW PROCESSLIST;"
# 5. 快速恢复
systemctl restart <service>
systemctl reload nginx
# 6. 临时扩容
sed -i 's/worker_connections .*/worker_connections 10240;/' /etc/nginx/nginx.conf
nginx -s reload
13.6 事件复盘模板#
## 故障复盘报告
### 基本信息
- 故障时间: 2024-XX-XX XX:XX ~ XX:XX (持续 XX 分钟)
- 影响范围: XX 服务不可用 / XX 功能异常
- 影响用户: 约 XX 用户
### 故障时间线
| 时间 | 事件 |
|------|------|
| 14:30 | 监控告警触发 |
| 14:32 | 运维确认故障 |
| 14:35 | 定位原因 |
| 14:45 | 修复方案确认 |
| 14:50 | 修复完成,服务恢复 |
### 根因分析
- 直接原因:
- 根本原因:
- 5 Whys: ...
### 改进措施
| 序号 | 措施 | 责任人 | 截止日期 |
|------|------|--------|----------|
| 1 | | | |
第十四章:运维脚本工具集#
14.1 SSH 批量管理#
#!/bin/bash
# ssh_batch.sh - 批量 SSH 执行命令
# 用法: ssh_batch.sh "uptime"
HOSTS_FILE="/opt/scripts/hosts.txt"
SSH_USER="ops"
SSH_PORT="2222"
SSH_KEY="/home/ops/.ssh/id_rsa"
[ -z "$1" ] && { echo "用法: $0 <command>"; exit 1; }
while read -r host; do
[[ -z "$host" || "$host" =~ ^# ]] && continue
echo "===== $host ====="
ssh -p "$SSH_PORT" -i "$SSH_KEY" -o StrictHostKeyChecking=no \
-o ConnectTimeout=5 "$SSH_USER@$host" "$1" 2>&1
echo
done < "$HOSTS_FILE"
14.2 SSL 证书自动检查#
#!/bin/bash
# check_certs.sh - SSL 证书到期检查
# crontab: 0 8 * * * /opt/scripts/check_certs.sh
DOMAINS=(
"example.com:443"
"api.example.com:443"
)
ALERT_DAYS=30
WEBHOOK="https://hooks.slack.com/services/xxx"
for entry in "${DOMAINS[@]}"; do
domain="${entry%:*}"
port="${entry#*:}"
expiry=$(echo | openssl s_client -servername "$domain" \
-connect "$domain:$port" 2>/dev/null | \
openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
[ -z "$expiry" ] && { echo "ERROR: $domain 无法获取"; continue; }
expiry_ts=$(date -d "$expiry" +%s)
now_ts=$(date +%s)
remain_days=$(( ($expiry_ts - $now_ts) / 86400 ))
echo "$domain: 剩余 $remain_days 天"
if [ "$remain_days" -lt "$ALERT_DAYS" ]; then
message="[紧急] $domain 证书将在 $remain_days 天后过期!"
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$message\"}" "$WEBHOOK"
fi
done
14.3 自动清理脚本#
#!/bin/bash
# cleanup.sh - 系统自动清理
# crontab: 0 3 * * * /opt/scripts/cleanup.sh
LOG_FILE="/var/log/cleanup.log"
RETENTION_DAYS=30
log() { echo "[$(date '+%F %T')] $*" | tee -a "$LOG_FILE"; }
log "=== 开始清理 ==="
# 清理旧日志
find /var/log -type f \( -name "*.log.*" -o -name "*.gz" \) -mtime +$RETENTION_DAYS -delete 2>/dev/null
# journald 清理
journalctl --vacuum-size=1G --vacuum-time=${RETENTION_DAYS}d 2>/dev/null
# 清理 /tmp
find /tmp -type f -mtime +7 -delete 2>/dev/null
# 清理包管理缓存
yum clean all 2>/dev/null; apt-get clean 2>/dev/null
# 清理旧内核 (CentOS)
command -v package-cleanup &>/dev/null && package-cleanup --oldkernels --count=2 -y 2>/dev/null
# 清理 core dump
find /var/lib/systemd/coredump -type f -mtime +7 -delete 2>/dev/null
# 磁盘空间警告
DISK_USAGE=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
[ "$DISK_USAGE" -gt 85 ] && log "警告: 根分区使用率 $DISK_USAGE%"
log "=== 清理完成 ==="
14.4 进程守护脚本#
#!/bin/bash
# process_guard.sh - 进程守护
# crontab: */1 * * * * /opt/scripts/process_guard.sh
PROCESSES=(
"nginx"
"mysqld"
"sshd"
)
for proc in "${PROCESSES[@]}"; do
if ! pgrep -x "$proc" > /dev/null; then
echo "[$(date)] $proc 未运行, 尝试启动..."
systemctl restart "$proc" 2>/dev/null || systemctl start "$proc" 2>/dev/null
sleep 3
if pgrep -x "$proc" > /dev/null; then
echo "[$(date)] $proc 启动成功"
else
echo "[$(date)] $proc 启动失败!!"
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"[$(hostname)] 进程 $proc 启动失败!\"}" \
"https://hooks.slack.com/services/xxx"
fi
fi
done
14.5 全量备份脚本 (整合版)#
#!/bin/bash
# full_backup.sh - 全量备份 (文件 + 数据库)
# crontab: 0 1 * * * /opt/scripts/full_backup.sh
BACKUP_BASE="/backup"
DATE=$(date +%Y%m%d)
LOG_FILE="$BACKUP_BASE/backup_$DATE.log"
RETENTION=7
BACKUP_PASS="your_backup_password"
log() { echo "[$(date '+%F %T')] $*" | tee -a "$LOG_FILE"; }
log "=== 全量备份开始 ==="
mkdir -p "$BACKUP_BASE/$DATE"
# 1. 文件备份 (restic)
export RESTIC_REPOSITORY="$BACKUP_BASE/restic"
export RESTIC_PASSWORD="$BACKUP_PASS"
log "执行 restic 备份..."
restic backup /data /etc /opt/scripts 2>&1 | tee -a "$LOG_FILE"
restic forget --keep-daily 7 --keep-weekly 4 --prune 2>&1 | tee -a "$LOG_FILE"
# 2. MySQL 备份
if command -v mysqldump &>/dev/null; then
log "执行 MySQL 备份..."
mysqldump --all-databases --single-transaction \
--routines --triggers --events \
--set-gtid-purged=OFF \
| gzip > "$BACKUP_BASE/$DATE/mysql_all.sql.gz"
log "MySQL 备份完成: $(ls -lh $BACKUP_BASE/$DATE/mysql_all.sql.gz | awk '{print $5}')"
fi
# 3. PostgreSQL 备份
if command -v pg_dumpall &>/dev/null; then
log "执行 PostgreSQL 备份..."
sudo -u postgres pg_dumpall | gzip > "$BACKUP_BASE/$DATE/postgres_all.sql.gz"
log "PG 备份完成: $(ls -lh $BACKUP_BASE/$DATE/postgres_all.sql.gz | awk '{print $5}')"
fi
# 4. 清理旧备份
log "清理 ${RETENTION} 天前的备份..."
find "$BACKUP_BASE" -maxdepth 1 -type d -mtime +$RETENTION -exec rm -rf {} +
# 5. 远程同步
if [ -n "$REMOTE_BACKUP_HOST" ]; then
log "同步到远程..."
rsync -avz --delete "$BACKUP_BASE/" "backup@$REMOTE_BACKUP_HOST:/backup/$(hostname)/"
fi
log "=== 全量备份完成 ==="
第十五章:运维最佳实践#
15.1 目录与命名规范#
目录规范#
/
├── opt/
│ ├── app/ # 应用目录
│ │ ├── bin/ # 可执行文件
│ │ ├── conf/ # 配置文件
│ │ └── lib/ # 库文件
│ └── scripts/ # 运维脚本
├── data/ # 应用数据 (独立分区)
│ ├── app/ # 应用数据
│ └── backup/ # 备份
├── var/log/
│ └── app/ # 应用日志
└── etc/
└── app/ # 应用配置
命名规范#
| 类型 |
规范 |
示例 |
| 主机名 |
环境-业务-序号 |
prod-web-01, stg-db-02 |
| DNS |
服务.环境.域名 |
api.prod.example.com |
| 端口 |
统一规划 |
Web: 80xx, API: 81xx, DB: 3306/5432 |
| 用户 |
app_<name> |
app_web, app_worker |
| 备份文件 |
type_YYYYMMDD |
mysql_all_20240101.sql.gz |
| 脚本命名 |
动词_对象 |
start_app.sh, backup_db.sh |
15.2 变更管理#
# 变更前
# 1. 备份当前配置
cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.$(date +%Y%m%d_%H%M)
# 2. 灰度验证路径
# 测试环境 → 预发布 → 生产一台 → 全量
# 3. 变更记录日志
# /opt/docs/changelog.md
# 变更后
# 4. 保留回滚方案
# 一键回滚: cp /etc/nginx/nginx.conf.OLD /etc/nginx/nginx.conf && nginx -s reload
# 5. 验证
curl -I https://example.com
ansible all -m shell -a "systemctl is-active nginx"
15.3 监控告警最佳实践#
1. 告警分级
P0 (紧急): 核心服务不可用 → 电话 + 即时消息
P1 (严重): 核心功能降级 → 即时消息
P2 (警告): 预警指标 → 邮件 / 群消息
P3 (通知): 信息性 → 群消息 (静默)
2. 告警设计原则
- 每条告警必须可操作 (不能是"指标高了")
- 告警必须有 runbook (怎么处理)
- 避免告警疲劳 (同一问题聚合)
- 静默期 (维护窗口)
3. 值班制度
- 主值 + 备值
- 明确升级路径
- 记录每次告警的处理情况
15.4 日常巡检清单#
#!/bin/bash
# daily_check.sh - 每日巡检
echo "===== 每日巡检: $(date) ====="
echo -e "\n--- 系统状态 ---"
uptime
free -h | grep -E "^Mem|^Swap"
echo -e "\n--- 磁盘 ---"
df -h | grep -vE "^tmpfs|^devtmpfs|^overlay"
echo -e "\n--- 最近错误日志 ---"
journalctl -p err --since "24 hours ago" --no-pager | tail -20
echo -e "\n--- 关键服务 ---"
for svc in nginx mysqld sshd postgresql docker; do
if systemctl is-active --quiet $svc 2>/dev/null; then
echo " ✓ $svc active"
elif systemctl is-enabled --quiet $svc 2>/dev/null; then
echo " ✗ $svc INACTIVE!"
fi
done
echo -e "\n--- 备份检查 ---"
ls -lh /backup/ | tail -5
echo -e "\n--- 连接状态 ---"
ss -s
echo -e "\n--- 最近登录 ---"
last -5
15.5 运维能力矩阵#
| 能力等级 |
技能要求 |
典型工具 |
| 初级 |
Linux 基础命令、服务启停、简单排错 |
top, journalctl, systemctl |
| 中级 |
监控搭建、自动化部署、性能调优、安全加固 |
Prometheus, Ansible, Nginx HA |
| 高级 |
架构设计、灾难恢复、全链路压测、信创适配 |
K8s, Terraform, MySQL HA |
| 专家 |
多活架构、运维平台开发、SRE 体系、成本优化 |
自研运维平台, eBPF, 混沌工程 |
参考资源: 本手册与仓库中的以下手册配合使用效果更佳:
Linux-使用手册.md — Linux 基础命令与发行版对比
Docker-使用手册.md — Docker 容器化
Kubernetes-使用手册.md — K8s 编排
Nginx-使用手册.md — Nginx 使用
MySQL-使用手册.md — MySQL 数据库
PostgreSQL-使用手册.md — PostgreSQL 数据库