Linux 运维实战手册

面向运维工程师的实战参考手册,涵盖监控、日志、备份、自动化、安全加固、高可用、性能调优、故障排查、信创系统等核心运维领域。与 Linux-使用手册.md 互补,侧重运维场景与操作实践。


目录


第一章:运维体系概述

1.1 运维核心职责

领域 职责 关键指标
监控 7×24 系统状态感知 MTTR, MTTD
变更管理 可控的系统变更流程 变更成功率
容量规划 资源趋势分析与扩容 资源利用率
故障处理 快速定位与恢复 SLA, RTO, RPO
安全合规 系统加固与审计 漏洞修复时效
自动化 减少人工操作 自动化覆盖率

1.2 运维 SLA 指标

可用性 = (总时间 - 故障时间) / 总时间 × 100%

99%     = 87.6 小时/年   (两个9)
99.9%   = 8.76 小时/年   (三个9)
99.99%  = 52.56 分钟/年  (四个9)
99.999% = 5.26 分钟/年   (五个9)

RTO (Recovery Time Objective): 业务恢复时间目标 RPO (Recovery Point Objective): 数据丢失时间目标

1.3 运维工具链全景

┌─────────────────────────────────────────────────────────────────┐
│                        运维工具矩阵                              │
├──────────────┬──────────────────┬───────────────────────────────┤
│ 监控告警     │ Prometheus       │ Grafana, Zabbix, Nagios       │
│ 日志管理     │ ELK, Loki        │ Splunk, Graylog               │
│ 自动化       │ Ansible          │ SaltStack, Puppet, Chef       │
│ CI/CD        │ Jenkins, GitLab  │ ArgoCD, Tekton                │
│ 配置管理     │ Ansible, Terraform│ Pulumi                       │
│ 容器编排     │ K8s, K3s         │ Nomad, Docker Swarm           │
│ 备份恢复     │ restic, Borg     │ Bacula, Veeam                 │
│ 安全扫描     │ Trivy, ClamAV    │ OpenSCAP, Lynis               │
│ 网络诊断     │ tcpdump, nmap    │ Wireshark, mtr                │
│ 压力测试     │ wrk, ab, sysbench│ JMeter, Locust                │
│ 信创/国产    │ 麒麟, 统信UOS    │ 欧拉, Anolis OS               │
└──────────────┴──────────────────┴───────────────────────────────┘

第二章:监控体系建设

2.1 Prometheus 监控栈

2.1.1 架构概览

┌──────────┐   ┌──────────┐   ┌──────────┐
│ node_ex  │   │ mysql_ex │   │ nginx_ex │  ← Exporters
│ porter   │   │ porter   │   │ porter   │
└────┬─────┘   └────┬─────┘   └────┬─────┘
     │              │              │
     ▼              ▼              ▼
┌─────────────────────────────────────────┐
│          Prometheus Server              │
│  (Pull metrics / 存储时序数据 / 告警判定) │
└──────────┬────────────────────┬─────────┘
           │                    │
           ▼                    ▼
   ┌──────────────┐    ┌──────────────┐
   │  Grafana     │    │ AlertManager │
   │  (可视化)    │    │  (告警管理)   │
   └──────────────┘    └──────┬───────┘
                              │
                              ▼
                      ┌──────────────┐
                      │ Webhook/邮件  │
                      │ 微信/钉钉/飞书│
                      └──────────────┘

2.1.2 Prometheus 安装配置

# === 下载安装 ===
cd /opt
wget https://github.com/prometheus/prometheus/releases/download/v2.52.0/prometheus-2.52.0.linux-amd64.tar.gz
tar xzf prometheus-2.52.0.linux-amd64.tar.gz
ln -s prometheus-2.52.0.linux-amd64 prometheus

# === 创建 systemd 服务 ===
cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/data/prometheus \
  --storage.tsdb.retention.time=30d \
  --web.enable-lifecycle \
  --web.external-url=http://prometheus.example.com
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

useradd -r -s /sbin/nologin prometheus
mkdir -p /data/prometheus
chown -R prometheus:prometheus /opt/prometheus /data/prometheus
systemctl daemon-reload && systemctl enable --now prometheus

2.1.3 prometheus.yml 核心配置

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    datacenter: 'bj-idc-01'
    env: 'production'

# 告警规则文件
rule_files:
  - 'rules/*.yml'

# 告警管理器
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

# 采集目标
scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (系统指标)
  - job_name: 'node'
    scrape_interval: 30s
    static_configs:
      - targets:
          - '192.168.1.11:9100'
          - '192.168.1.12:9100'
          - '192.168.1.13:9100'
        labels:
          env: 'production'
    # 基于文件的动态发现
    file_sd_configs:
      - files:
          - '/opt/prometheus/targets/node/*.json'
        refresh_interval: 5m

  # MySQL Exporter
  - job_name: 'mysql'
    static_configs:
      - targets: ['192.168.1.21:9104']
        labels:
          instance: 'mysql-master'

  # Redis Exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['192.168.1.31:9121']

  # Nginx Exporter (需 nginx-module-vts)
  - job_name: 'nginx'
    static_configs:
      - targets: ['192.168.1.41:9113']

2.1.4 node_exporter 部署

# 安装 node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.0/node_exporter-1.8.0.linux-amd64.tar.gz
tar xzf node_exporter-1.8.0.linux-amd64.tar.gz
mv node_exporter-1.8.0.linux-amd64/node_exporter /usr/local/bin/

cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat \
  --collector.filesystem.mount-points-exclude='^/(dev|proc|sys|run|var/lib/docker/.+|var/lib/kubelet/.+)' \
  --web.listen-address=:9100

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload && systemctl enable --now node_exporter

2.1.5 常用 Exporters 速查

Exporter 端口 用途
node_exporter 9100 系统 CPU/内存/磁盘/网络
mysqld_exporter 9104 MySQL/MariaDB
redis_exporter 9121 Redis
postgres_exporter 9187 PostgreSQL
nginx-prometheus-exporter 9113 Nginx
blackbox_exporter 9115 HTTP/TCP/ICMP 探测
process-exporter 9256 进程监控
kafka_exporter 9308 Kafka
elasticsearch_exporter 9114 Elasticsearch

2.1.6 Grafana 部署与配置

# Ubuntu/Debian
apt-get install -y software-properties-common
add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | apt-key add -
apt-get update && apt-get install -y grafana

# CentOS/RHEL 7/8
cat > /etc/yum.repos.d/grafana.repo << 'EOF'
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF
yum install -y grafana

systemctl enable --now grafana-server

# 重置管理员密码
grafana-cli admin reset-admin-password newpassword

# 安装常用插件
grafana-cli plugins install grafana-piechart-panel
grafana-cli plugins install grafana-clock-panel
grafana-cli plugins install vonage-status-panel
systemctl restart grafana-server

2.1.7 重要 Grafana Dashboard ID

Dashboard ID 名称 适用场景
1860 Node Exporter Full 服务器全量指标
16098 Node Exporter / nodes 新版服务器监控
7362 MySQL Overview MySQL 监控
763 Redis Dashboard Redis 监控
9628 PostgreSQL Database PostgreSQL
12708 Nginx Overview Nginx 监控
11159 Docker Host & Container Docker 监控

2.1.8 告警规则示例

# /opt/prometheus/rules/node_alerts.yml
groups:
  - name: node_alerts
    interval: 30s
    rules:
      # 主机宕机
      - alert: NodeDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "主机 {{ $labels.instance }} 宕机"
          description: "主机 {{ $labels.instance }} 已超过 2 分钟不可达"

      # CPU 使用率过高
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率 > 90%: {{ $labels.instance }}"
          description: "当前值: {{ $value | humanize }}%"

      # 内存使用率
      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率 > 90%: {{ $labels.instance }}"

      # 磁盘使用率
      - alert: HighDiskUsage
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘使用率 > 85%: {{ $labels.instance }} /"

      # 磁盘预计填满
      - alert: DiskWillFillIn4Hours
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "磁盘预计 4 小时内填满: {{ $labels.instance }}"

      # 系统负载过高
      - alert: HighSystemLoad
        expr: node_load15 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 1.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "负载过高 load15/cores > 1.5: {{ $labels.instance }}"

      # 磁盘 IO 饱和
      - alert: DiskIOSaturation
        expr: rate(node_disk_io_time_seconds_total{device=~"sd[a-z]+"}[5m]) > 0.9
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "磁盘 IO 饱和: {{ $labels.instance }} {{ $labels.device }}"

      # 网络错误
      - alert: NetworkErrors
        expr: rate(node_network_receive_errors_total[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "网络接口错误: {{ $labels.instance }} {{ $labels.device }}"

      # 内存即将耗尽
      - alert: OutOfMemorySoon
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "内存即将耗尽 (< 5%): {{ $labels.instance }}"

      # inode 使用率
      - alert: HighInodeUsage
        expr: (1 - node_filesystem_files_free{mountpoint="/"} / node_filesystem_files{mountpoint="/"}) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "inode 使用率 > 85%: {{ $labels.instance }}"

2.1.9 AlertManager 配置

# /opt/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  # 邮件配置
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'
  smtp_require_tls: true

# 告警路由
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical'
      continue: true
    - match:
        severity: warning
      receiver: 'warning'

# 接收器
receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'

  - name: 'critical'
    email_configs:
      - to: '[email protected]'
    webhook_configs:
      # 钉钉
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
        send_resolved: true
    # 企业微信
    wechat_configs:
      - corp_id: 'wwxxx'
        to_party: '1'
        agent_id: '1000001'
        api_secret: 'xxx'
        send_resolved: true

  - name: 'warning'
    email_configs:
      - to: '[email protected]'

# 抑制规则 (避免告警风暴)
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['instance']

2.2 PromQL 常用查询

# === CPU ===
# CPU 使用率 (%)
100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 各 CPU 模式占比
avg by(cpu,mode)(rate(node_cpu_seconds_total[5m])) * 100

# === 内存 ===
# 可用内存百分比
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# === 磁盘 ===
# 磁盘使用率
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# 磁盘读写速率 MB/s
rate(node_disk_read_bytes_total[5m]) / 1024 / 1024
rate(node_disk_written_bytes_total[5m]) / 1024 / 1024

# 磁盘 IOPS
rate(node_disk_reads_completed_total[5m])
rate(node_disk_writes_completed_total[5m])

# disk_io_time (IO 繁忙度)
rate(node_disk_io_time_seconds_total[5m]) * 100

# === 网络 ===
# 网络流量 bytes/sec
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# 网络连接状态
node_netstat_Tcp_CurrEstab

# TCP 重传率
rate(node_netstat_Tcp_RetransSegs[5m]) / rate(node_netstat_Tcp_OutSegs[5m]) * 100

# === 进程 ===
# 打开文件描述符数量
process_open_fds

# === 系统 ===
# 运行时间 (秒)
node_boot_time_seconds

# 预测磁盘空间
predict_linear(node_filesystem_avail_bytes[1h], 24*3600) < 0

2.3 Grafana 告警 (内置)

当不需要 AlertManager 时,Grafana 内置告警可直接使用:

# grafana.ini 配置
[smtp]
enabled = true
host = smtp.example.com:587
user = [email protected]
password = password
from_address = [email protected]

[alerting]
enabled = true
execute_alerts = true

告警通知渠道支持:Email, Slack, PagerDuty, Webhook, 钉钉(通过插件), 企业微信(通过插件)。

2.4 轻量监控方案

2.4.1 Netdata (单机实时监控)

# 一键安装
bash <(curl -Ss https://my-netdata.io/kickstart.sh)

# 仅本机访问
sed -i 's/bind to = \*/bind to = 127.0.0.1/g' /etc/netdata/netdata.conf
systemctl restart netdata

# 访问 http://localhost:19999
# 特点: 零配置、极低资源占用、1秒粒度、数千指标自动采集

2.4.2 自定义脚本监控 (最简方案)

#!/bin/bash
# /opt/scripts/monitor.sh
# crontab: */5 * * * * /opt/scripts/monitor.sh

HOSTNAME=$(hostname)
ALERT_WEBHOOK="https://hooks.slack.com/services/xxx"

# CPU
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > 90" | bc -l) )); then
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"[${HOSTNAME}] CPU 使用率: ${CPU_USAGE}%\"}" \
        $ALERT_WEBHOOK
fi

# 内存
MEM_AVAIL=$(free -m | awk 'NR==2{print $7}')
if [ "$MEM_AVAIL" -lt 512 ]; then
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"[${HOSTNAME}] 可用内存不足: ${MEM_AVAIL}MB\"}" \
        $ALERT_WEBHOOK
fi

# 磁盘
DISK_USAGE=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 85 ]; then
    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"[${HOSTNAME}] 磁盘使用率: ${DISK_USAGE}%\"}" \
        $ALERT_WEBHOOK
fi

# 关键进程检查
for proc in nginx mysqld sshd; do
    if ! pgrep -x "$proc" > /dev/null; then
        curl -X POST -H 'Content-type: application/json' \
            --data "{\"text\":\"[${HOSTNAME}] 进程 $proc 未运行!\"}" \
            $ALERT_WEBHOOK
    fi
done

第三章:日志管理

3.1 rsyslog 配置

3.1.1 基础架构

应用程序 → syslog() → rsyslog → /var/log/messages
                              → /var/log/secure
                              → 远程 syslog 服务器
                              → 管道/程序

3.1.2 rsyslog.conf 配置

# /etc/rsyslog.conf

# === 模块加载 ===
module(load="imuxsock")     # 本地 socket
module(load="imklog")       # 内核日志
module(load="imtcp")        # TCP 接收
module(load="imudp")        # UDP 接收
module(load="impstats"
       interval="300"
       severity="7"
       log.file="/var/log/rsyslog-stats.log"
       Ruleset="stats")      # 性能统计

# === 全局配置 ===
global(
    workDirectory="/var/lib/rsyslog"
    maxMessageSize="64k"
    defaultTemplate="RSYSLOG_TraditionalFileFormat"
    privDropToUser="syslog"
    privDropToGroup="syslog"
)

# === 日志格式模板 ===
template(name="RemoteLogs" type="string"
    string="%TIMESTAMP% %HOSTNAME% %syslogtag% %msg%\n")

template(name="JsonFormat" type="list") {
    constant(value="{")
    constant(value="\"timestamp\":\"")      property(name="timereported" dateFormat="rfc3339")
    constant(value="\",\"host\":\"")        property(name="hostname")
    constant(value="\",\"severity\":\"")    property(name="syslogseverity")
    constant(value="\",\"facility\":\"")    property(name="syslogfacility")
    constant(value="\",\"tag\":\"")         property(name="syslogtag" format="json")
    constant(value="\",\"message\":\"")     property(name="msg" format="json")
    constant(value="\"}\n")
}

# === 日志规则 ===
# 认证日志
auth,authpriv.*                     /var/log/secure

# 系统日志
*.info;mail.none;authpriv.none     /var/log/messages

# Cron 日志
cron.*                             /var/log/cron

# 内核日志
kern.*                             /var/log/kern.log

# 邮件日志
mail.*                             /var/log/maillog

# Debug 日志 (丢弃)
*.debug                            stop

# 紧急日志发送给所有登录用户
*.emerg                            :omusrmsg:*

# === 转发到远程 ===
# TCP 转发
*.* @@192.168.1.100:514

# UDP 转发
# *.* @192.168.1.100:514

# 条件转发 (仅错误级别以上)
*.err @@192.168.1.100:514

# === 作为日志服务器接收 ===
input(type="imtcp" port="514" Ruleset="remote")
input(type="imudp" port="514" Ruleset="remote")

ruleset(name="remote") {
    # 按主机名分文件
    $template RemotePath,"/data/logs/%HOSTNAME%/%$YEAR%-%$MONTH%-%$DAY%.log"
    action(type="omfile" dynaFile="RemotePath")
    # 也可输出为 JSON
    # action(type="omfile" dynaFile="RemotePath" template="JsonFormat")
}

3.1.3 应用日志配置示例

# Nginx rsyslog 配置
# /etc/rsyslog.d/nginx.conf
$ModLoad imfile
$InputFileName /var/log/nginx/access.log
$InputFileTag nginx-access:
$InputFileStateFile stat-nginx-access
$InputFileSeverity info
$InputFileFacility local7
$InputRunFileMonitor

$InputFileName /var/log/nginx/error.log
$InputFileTag nginx-error:
$InputFileStateFile stat-nginx-error
$InputFileSeverity error
$InputFileFacility local7
$InputRunFileMonitor

local7.* @@192.168.1.100:514

3.2 journald (systemd 日志)

# === 查看日志 ===
journalctl                           # 所有日志
journalctl -n 100                    # 最近 100 行
journalctl -f                        # tail -f 模式
journalctl -k                        # 内核日志
journalctl -u nginx                  # 指定服务
journalctl -u nginx --since today    # 今天的日志
journalctl -u nginx --since "2024-01-01" --until "2024-01-02"
journalctl -p err                     # 仅错误级别以上
journalctl -p emerg..err              # emerg 到 err
journalctl _PID=1234                  # 按 PID
journalctl _UID=0                     # 按 UID (root)
journalctl -o json-pretty             # JSON 输出
journalctl --disk-usage               # 日志占用空间
journalctl -u sshd | grep "Failed"   # 配合 grep

# === journald 配置 ===
# /etc/systemd/journald.conf
[Journal]
Storage=persistent        # 持久化到磁盘
Compress=yes             # 压缩
Seal=yes                 # 防篡改密封
SystemMaxUse=4G          # 最多使用 4G
SystemMaxFileSize=100M   # 单文件最大
MaxRetentionSec=2week    # 最多保留 2 周
RuntimeMaxUse=1G         # /run 下最大使用
ForwardToSyslog=no       # 是否转发到 syslog
ForwardToConsole=no

systemctl restart systemd-journald

3.3 logrotate 日志轮转

# /etc/logrotate.conf (全局)
weekly
rotate 12
create
dateext
compress
include /etc/logrotate.d

# /etc/logrotate.d/nginx (应用级)
/var/log/nginx/*.log {
    daily                       # 每天轮转
    missingok                   # 日志不存在不报错
    rotate 30                   # 保留 30 天
    compress                    # 压缩旧日志
    delaycompress               # 延迟一个周期压缩
    notifempty                  # 空文件不轮转
    create 640 nginx adm        # 创建新文件权限
    sharedscripts               # 轮转完后执行一次脚本
    postrotate
        [ -f /var/run/nginx.pid ] && kill -USR1 $(cat /var/run/nginx.pid)
    endscript
    dateext                     # 日期后缀
    dateformat -%Y%m%d
    maxsize 500M                # 超过 500M 强制轮转
}

# /etc/logrotate.d/syslog
/var/log/cron
/var/log/maillog
/var/log/messages
/var/log/secure
/var/log/spooler
{
    weekly
    rotate 12
    compress
    dateext
    missingok
    sharedscripts
    postrotate
        /bin/kill -HUP $(cat /var/run/syslogd.pid 2>/dev/null) 2>/dev/null || true
    endscript
}

# 手动执行轮转
logrotate -f /etc/logrotate.conf
logrotate -d /etc/logrotate.d/nginx   # 调试模式 (不实际轮转)

# Crontab 中每日执行
# 0 0 * * * /usr/sbin/logrotate /etc/logrotate.conf

3.4 ELK Stack (Elasticsearch + Logstash + Kibana)

3.4.1 架构

应用日志 → Filebeat → Logstash → Elasticsearch → Kibana
                                           ↕
                                    数据节点集群 (3+)

3.4.2 Filebeat 配置

# /etc/filebeat/filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/nginx/access.log
    fields:
      app: nginx
      type: access
    fields_under_root: true
    json.keys_under_root: true
    json.add_error_key: true

  - type: log
    enabled: true
    paths:
      - /var/log/nginx/error.log
    fields:
      app: nginx
      type: error
    fields_under_root: true
    multiline.pattern: '^\d{4}/\d{2}/\d{2}'
    multiline.negate: true
    multiline.match: after

  - type: log
    enabled: true
    paths:
      - /var/log/messages
    fields:
      app: system
      type: syslog

# === 输出到 Logstash ===
output.logstash:
  hosts: ["192.168.1.101:5044"]
  loadbalance: true
  compression_level: 3

# === 或直接输出到 Elasticsearch ===
# output.elasticsearch:
#   hosts: ["192.168.1.101:9200", "192.168.1.102:9200"]
#   index: "filebeat-%{[agent.version]}-%{+yyyy.MM.dd}"

# === 处理器 ===
processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - drop_fields:
      fields: ["agent.ephemeral_id", "agent.id"]

# === 日志本身 ===
logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/filebeat
  name: filebeat.log
  keepfiles: 7

3.4.3 Logstash 配置

# /etc/logstash/conf.d/pipeline.conf
input {
  beats {
    port => 5044
    client_inactivity_timeout => 3600
  }
}

filter {
  if [type] == "access" {
    grok {
      match => {
        "message" => '%{IPORHOST:client_ip} - %{DATA:remote_user} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{DATA:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:body_bytes_sent} "%{DATA:http_referer}" "%{DATA:http_user_agent}"'
      }
    }
    date {
      match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
      target => "@timestamp"
    }
    geoip {
      source => "client_ip"
    }
    useragent {
      source => "http_user_agent"
      target => "user_agent"
    }
  }

  if [type] == "syslog" {
    grok {
      match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["http://192.168.1.101:9200", "http://192.168.1.102:9200"]
    index => "logstash-%{[app]}-%{+YYYY.MM.dd}"
    manage_template => false
  }
  # 如果 ES 不可用,存磁盘队列
  # dead_letter_queue {
  #   path => "/data/logstash/dead_letter_queue"
  #   max_bytes => "1gb"
  # }
}

3.5 Grafana Loki (轻量日志方案)

Loki 是 Grafana 生态的日志方案,类似 Prometheus 但用于日志:

# promtail (日志采集器) 配置
# /etc/promtail/config.yml
server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: system
    static_configs:
      - targets: [localhost]
        labels:
          job: varlogs
          host: ${HOSTNAME}
          __path__: /var/log/*.log

  - job_name: nginx
    static_configs:
      - targets: [localhost]
        labels:
          job: nginx
          host: ${HOSTNAME}
          __path__: /var/log/nginx/*.log
    pipeline_stages:
      - match:
          selector: '{job="nginx"} |= "/var/log/nginx/error.log"'
          stages:
            - regex:
                expression: '^(?P<time>\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\]'
            - labels:
                level:
# loki 配置 (单机)
# /etc/loki/loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 30m
  max_chunk_age: 1h
  chunk_target_size: 1536000
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/index
    cache_location: /data/loki/cache
  filesystem:
    directory: /data/loki/chunks

limits_config:
  enforce_metric_name: false
  max_entries_limit_per_query: 5000
  retention_period: 30d

compactor:
  working_directory: /data/loki/compactor
  shared_store: filesystem

第四章:备份与灾难恢复

4.1 备份策略 3-2-1 原则

3 份数据副本 (生产 + 2 备份)
2 种不同存储介质 (本地磁盘 + 磁带/云存储)
1 份异地备份 (不同数据中心/区域)

4.2 rsync 备份方案

# === 基础用法 ===
# 本地同步
rsync -avz --delete /data/ /backup/

# 远程同步 (SSH)
rsync -avz -e "ssh -p 22" /data/ [email protected]:/backup/

# 远程拉取
rsync -avz [email protected]:/data/ /backup/

# === 生产级备份脚本 ===
#!/bin/bash
# /opt/scripts/backup.sh

BACKUP_SRC="/data/app"
BACKUP_DST="/backup"
REMOTE_HOST="[email protected]"
REMOTE_PATH="/backup/$(hostname)"
LOG_FILE="/var/log/backup.log"
EXCLUDE_FILE="/opt/scripts/backup_exclude.txt"
LOCK_FILE="/var/run/backup.lock"

log() {
    echo "[$(date '+%F %T')] $*" | tee -a "$LOG_FILE"
}

# 防止并发执行
exec 200>"$LOCK_FILE"
flock -n 200 || { log "备份已在运行,退出"; exit 1; }

log "=== 开始备份 ==="

# 1. 本地每日快照 (保留 7 天)
rsync -avz \
    --delete \
    --exclude-from="$EXCLUDE_FILE" \
    --link-dest="$BACKUP_DST/latest" \
    "$BACKUP_SRC/" \
    "$BACKUP_DST/$(date +%Y%m%d)/"

# 2. 更新 latest 符号链接
rm -f "$BACKUP_DST/latest"
ln -s "$BACKUP_DST/$(date +%Y%m%d)" "$BACKUP_DST/latest"

# 3. 远程同步
rsync -avz --delete \
    -e "ssh -p 22 -i /root/.ssh/backup_key" \
    "$BACKUP_DST/" "$REMOTE_HOST:$REMOTE_PATH/"

# 4. 清理旧备份 (超过 30 天的远程备份)
ssh -i /root/.ssh/backup_key "$REMOTE_HOST" \
    "find $REMOTE_PATH -maxdepth 1 -type d -mtime +30 -exec rm -rf {} +"

log "=== 备份完成 ==="

4.3 restic (现代 Go 备份工具)

# === 安装 ===
wget https://github.com/restic/restic/releases/download/v0.16.4/restic_0.16.4_linux_amd64.bz2
bunzip2 restic_0.16.4_linux_amd64.bz2
mv restic_0.16.4_linux_amd64 /usr/local/bin/restic
chmod +x /usr/local/bin/restic

# === 初始化仓库 ===
export RESTIC_REPOSITORY=/backup/restic
export RESTIC_PASSWORD=your_strong_password
restic init

# === 远程仓库 ===
export RESTIC_REPOSITORY=sftp:backup@storage:/backup/restic
export RESTIC_PASSWORD=your_strong_password
restic init

# S3 兼容 (MinIO / AWS S3)
export AWS_ACCESS_KEY_ID=xxx
export AWS_SECRET_ACCESS_KEY=yyy
export RESTIC_REPOSITORY=s3:s3.amazonaws.com/bucket-name/restic
restic init

# === 备份 ===
restic backup /data /etc /var/log --exclude "*.tmp" --exclude "*.log"

# === 快照管理 ===
restic snapshots                          # 列出快照
restic diff <snapshot1> <snapshot2>       # 比较快照差异
restic stats                              # 仓库统计

# === 恢复 ===
restic restore latest --target /restore/path/    # 恢复最新
restic restore <snapshot_id> --target /restore/  # 恢复指定快照
restic restore latest --target /restore/ --include "/data/app/*"

# === 清理 ===
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 12 --keep-yearly 2
restic prune     # 删除未引用的数据块
restic check     # 验证仓库完整性

# === 自动备份脚本 ===
#!/bin/bash
export RESTIC_REPOSITORY=/backup/restic
export RESTIC_PASSWORD_FILE=/root/.restic-password
BACKUP_SRC="/data /etc /var/log"

restic backup $BACKUP_SRC \
    --exclude "*.tmp" \
    --exclude "*.bak" \
    --tag "$(date +%Y%m%d)" \
    --host "$(hostname)"

restic forget \
    --keep-daily 7 \
    --keep-weekly 4 \
    --keep-monthly 12 \
    --keep-yearly 2 \
    --prune

restic check --read-data-subset=2%

4.4 数据库备份

MySQL

#!/bin/bash
# MySQL 全量 + binlog 备份

DB_USER="backup"
DB_PASS="password"
BACKUP_DIR="/backup/mysql"
RETENTION_DAYS=7

# 全量备份 (mysqldump)
mysqldump -u$DB_USER -p$DB_PASS --all-databases \
    --single-transaction \
    --routines --triggers --events \
    --master-data=2 \
    --set-gtid-purged=OFF \
    | gzip > "$BACKUP_DIR/full_$(date +%Y%m%d_%H%M).sql.gz"

# 或使用 xtrabackup (物理备份,大库推荐)
xtrabackup --backup \
    --user=$DB_USER --password=$DB_PASS \
    --target-dir="$BACKUP_DIR/xtra_$(date +%Y%m%d_%H%M)" \
    --compress --compress-threads=4

# 清理旧备份
find "$BACKUP_DIR" -mtime +$RETENTION_DAYS -delete

PostgreSQL

#!/bin/bash
# PostgreSQL 备份

BACKUP_DIR="/backup/postgres"
export PGPASSWORD="password"

# 逻辑备份
pg_dumpall -U postgres -h localhost | gzip > "$BACKUP_DIR/pg_all_$(date +%Y%m%d).sql.gz"

# 单库自定义格式 (支持并行恢复)
pg_dump -U postgres -h localhost -Fc -j 4 mydb > "$BACKUP_DIR/mydb_$(date +%Y%m%d).dump"

# WAL 归档配置 (postgresql.conf)
# wal_level = replica
# archive_mode = on
# archive_command = 'test ! -f /backup/pg_wal/%f && cp %p /backup/pg_wal/%f'

4.5 系统级备份

# === 分区镜像备份 (dd) ===
dd if=/dev/sda1 of=/backup/sda1_$(date +%Y%m%d).img bs=4M status=progress

# 压缩
dd if=/dev/sda1 bs=4M | gzip > /backup/sda1.img.gz

# 恢复
gunzip -c /backup/sda1.img.gz | dd of=/dev/sda1 bs=4M status=progress

# === tar 系统备份 ===
tar -cvpzf /backup/system_$(date +%Y%m%d).tar.gz \
    --exclude=/proc \
    --exclude=/tmp \
    --exclude=/sys \
    --exclude=/dev \
    --exclude=/run \
    --exclude=/mnt \
    --exclude=/media \
    --exclude=/backup \
    --exclude=/lost+found \
    /

# 恢复
tar -xvpzf /backup/system_20240101.tar.gz -C /

4.6 灾难恢复演练

#!/bin/bash
# DR 演练检查清单脚本

echo "=== 灾难恢复检查清单 ==="
echo "日期: $(date)"

# 1. 验证备份完整性
echo -e "\n[1/6] 验证备份完整性..."
BACKUP_DIR="/backup"
LATEST=$(ls -t "$BACKUP_DIR" | head -1)
if [ -n "$LATEST" ]; then
    echo "  ✓ 最新备份: $LATEST"
else
    echo "  ✗ 未找到备份!"
fi

# 2. 验证备份可恢复性 (抽检)
echo -e "\n[2/6] 验证备份可恢复性..."
if restic check --read-data-subset=1%; then
    echo "  ✓ restic 仓库完整性验证通过"
fi

# 3. 验证数据库备份
echo -e "\n[3/6] 验证数据库备份..."
LATEST_SQL=$(ls -t /backup/mysql/*.sql.gz 2>/dev/null | head -1)
if gzip -t "$LATEST_SQL" 2>/dev/null; then
    echo "  ✓ MySQL 备份文件完整性验证通过"
fi

# 4. 验证远程同步
echo -e "\n[4/6] 验证远程备份..."
if rsync -azn --delete /backup/ [email protected]:/backup/; then
    echo "  ✓ 远程连接正常"
fi

# 5. 验证恢复文档
echo -e "\n[5/6] 验证恢复文档..."
if [ -f /opt/docs/recovery_procedure.md ]; then
    echo "  ✓ 恢复文档存在"
else
    echo "  ✗ 缺少恢复文档!"
fi

# 6. 验证恢复时间
echo -e "\n[6/6] 恢复时间估算..."
echo "  上次全量恢复耗时: ~30分钟 (记录于 2024-01-01)"
echo "  RTO 目标: 2小时"
echo "  RPO 目标: < 1小时 (binlog 实时同步)"

echo -e "\n=== 检查完成 ==="

第五章:自动化运维

5.1 Ansible 基础

5.1.1 核心概念

┌─────────────┐
│  控制节点    │  (Ansible 安装在此)
│  playbook   │  无需 agent,通过 SSH 管理
└──────┬──────┘
       │ SSH
  ┌────┼────┐
  ▼    ▼    ▼
┌───┐┌───┐┌───┐
│ N1││ N2││ N3│  被管理节点 (只需 Python)
└───┘└───┘└───┘

5.1.2 安装与配置

# 安装
yum install -y ansible        # CentOS/RHEL
apt-get install -y ansible    # Ubuntu/Debian
pip install ansible            # pip (最新版)

# 验证
ansible --version

# === ansible.cfg 配置 ===
# /etc/ansible/ansible.cfg 或 ./ansible.cfg
[defaults]
inventory      = ./hosts
host_key_checking = False
remote_user    = root
private_key_file = /root/.ssh/id_rsa
forks          = 20
timeout        = 30
log_path       = /var/log/ansible.log
gathering      = smart
fact_caching   = jsonfile
fact_caching_connection = /tmp/ansible_cache
fact_caching_timeout = 3600
retry_files_enabled = False
callback_whitelist = timer, profile_tasks
stdout_callback = yaml

[privilege_escalation]
become         = True
become_method  = sudo
become_user    = root

[ssh_connection]
pipelining     = True
control_path   = /tmp/ansible-%%h-%%p-%%r

5.1.3 Inventory 主机清单

# hosts (静态清单)
[webservers]
web01 ansible_host=192.168.1.11
web02 ansible_host=192.168.1.12
web03 ansible_host=192.168.1.13

[dbservers]
db01 ansible_host=192.168.1.21 ansible_user=dbadmin
db02 ansible_host=192.168.1.22

[appservers]
app[01:05].example.com             # 范围: app01 ~ app05

[production:children]              # 分组嵌套
webservers
dbservers

[production:vars]                  # 组变量
ansible_user=root
ntp_server=ntp.prod.example.com

[all:vars]                         # 全局变量
ansible_port=22
# hosts.yml (YAML 清单)
all:
  hosts:
    bastion:
      ansible_host: 1.2.3.4
  children:
    production:
      hosts:
        web[01:03].example.com:
      vars:
        env: production
    staging:
      hosts:
        web-stg.example.com:
      vars:
        env: staging

5.1.4 常用 Ad-Hoc 命令

# 基本语法: ansible <pattern> -m <module> -a "<arguments>"

# === 信息收集 ===
ansible all -m ping                                       # 存活检测
ansible all -m setup                                      # 收集 facts
ansible all -m setup -a "filter=ansible_memory_mb"        # 过滤 facts
ansible all -m shell -a "hostname; uptime"                # 执行 shell

# === 文件操作 ===
ansible all -m copy -a "src=/tmp/file dest=/tmp/file"     # 拷贝文件
ansible all -m fetch -a "src=/etc/hosts dest=/tmp/"       # 拉取文件
ansible all -m file -a "path=/data state=directory mode=0755"  # 创建目录
ansible all -m replace -a "path=/etc/nginx/nginx.conf regexp='worker_processes.*' replace='worker_processes 8;'"  # 文件内容替换

# === 包管理 ===
ansible all -m yum -a "name=nginx state=latest"           # 安装 (RHEL)
ansible all -m apt -a "name=nginx state=latest update_cache=yes"  # 安装 (Debian)

# === 服务管理 ===
ansible all -m systemd -a "name=nginx state=restarted"    # 重启服务
ansible all -m systemd -a "name=nginx enabled=yes"        # 开机启动

# === 用户管理 ===
ansible all -m user -a "name=app password={{ 'mypass' | password_hash('sha512') }} groups=wheel state=present"

# === 内核参数 ===
ansible all -m sysctl -a "name=net.ipv4.tcp_tw_reuse value=1 sysctl_set=yes reload=yes"

# === 计划任务 ===
ansible all -m cron -a "name='log cleanup' hour=2 job='/opt/scripts/cleanup.sh'"

# === 防火墙 ===
ansible all -m firewalld -a "port=80/tcp permanent=yes state=enabled immediate=yes"

5.1.5 Playbook 编写

# deploy_webapp.yml
---
- name: 部署 Web 应用
  hosts: webservers
  become: yes
  vars:
    app_name: myapp
    app_port: 8080
    app_version: "1.2.3"
    nginx_worker_processes: "{{ ansible_processor_vcpus }}"

  vars_files:
    - vars/secrets.yml   # 加密的敏感变量

  pre_tasks:
    - name: 更新 yum 缓存
      yum:
        update_cache: yes
        name: '*'
        state: latest
      when: ansible_os_family == "RedHat"
      tags: [update]

    - name: 检查磁盘空间
      shell: df -h /data | awk 'NR==2{print $5}' | sed 's/%//'
      register: disk_usage
      failed_when: disk_usage.stdout|int > 90

  tasks:
    - name: 安装基础包
      package:
        name: "{{ item }}"
        state: present
      loop:
        - nginx
        - supervisor
        - python3
      tags: [packages]

    - name: 创建应用用户
      user:
        name: "{{ app_name }}"
        system: yes
        shell: /sbin/nologin
        create_home: no
      tags: [user]

    - name: 创建目录结构
      file:
        path: "{{ item }}"
        state: directory
        owner: "{{ app_name }}"
        mode: '0755'
      loop:
        - /opt/{{ app_name }}
        - /opt/{{ app_name }}/config
        - /data/{{ app_name }}
        - /var/log/{{ app_name }}
      tags: [dirs]

    - name: 部署应用文件
      copy:
        src: files/{{ app_name }}-{{ app_version }}.jar
        dest: /opt/{{ app_name }}/{{ app_name }}.jar
        owner: "{{ app_name }}"
        mode: '0644'
      notify: restart app          # 触发 handler
      tags: [deploy]

    - name: 配置 Nginx 反向代理
      template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/conf.d/{{ app_name }}.conf
        validate: nginx -t -c %s
      notify: reload nginx
      tags: [nginx]

    - name: 配置 systemd 服务
      template:
        src: templates/app.service.j2
        dest: /etc/systemd/system/{{ app_name }}.service
      notify: restart app
      tags: [service]

    - name: 启动服务
      systemd:
        name: "{{ app_name }}"
        state: started
        enabled: yes
      tags: [service]

  handlers:
    - name: reload nginx
      systemd:
        name: nginx
        state: reloaded

    - name: restart app
      systemd:
        name: "{{ app_name }}"
        state: restarted

  post_tasks:
    - name: 健康检查
      uri:
        url: "http://localhost:{{ app_port }}/health"
        status_code: 200
      retries: 10
      delay: 3
      until: result.status == 200
      register: result
      tags: [verify]

    - name: 发送部署通知
      slack:
        token: "{{ slack_token }}"
        msg: "{{ app_name }} v{{ app_version }} 部署完成 [{{ ansible_hostname }}]"
      tags: [notify]

5.1.6 Jinja2 模板示例

{# templates/nginx.conf.j2 #}
upstream {{ app_name }}_backend {
{% for host in groups['appservers'] %}
    server {{ hostvars[host]['ansible_host'] }}:{{ app_port }} weight=1 max_fails=3 fail_timeout=30s;
{% endfor %}
    keepalive 32;
}

server {
    listen 80;
    server_name {{ app_name }}.example.com;

    access_log /var/log/nginx/{{ app_name }}_access.log;
    error_log  /var/log/nginx/{{ app_name }}_error.log;

    location / {
        proxy_pass http://{{ app_name }}_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # 基于环境控制超时
        {% if env == 'production' %}
        proxy_read_timeout 60s;
        {% else %}
        proxy_read_timeout 300s;
        {% endif %}
    }

    # 仅生产环境启用 SSL
    {% if env == 'production' %}
    listen 443 ssl;
    ssl_certificate     /etc/ssl/certs/{{ app_name }}.crt;
    ssl_certificate_key /etc/ssl/private/{{ app_name }}.key;
    if ($scheme != "https") {
        return 301 https://$host$request_uri;
    }
    {% endif %}
}

5.1.7 Ansible Vault (加密敏感数据)

# 创建加密文件
ansible-vault create vars/secrets.yml

# 编辑加密文件
ansible-vault edit vars/secrets.yml

# 加密已有文件
ansible-vault encrypt vars/secrets.yml

# 使用密码文件 (生产环境)
echo "my_vault_password" > ~/.vault_pass
ansible-vault encrypt vars/secrets.yml --vault-password-file ~/.vault_pass

# 运行 playbook 时解密
ansible-playbook deploy.yml --vault-password-file ~/.vault_pass
ansible-playbook deploy.yml --ask-vault-pass   # 交互输入

# 多环境密码 (vault-id)
ansible-vault encrypt --vault-id prod@prompt vars/prod/secrets.yml
ansible-playbook deploy.yml --vault-id prod@prompt

5.1.8 Ansible Roles

roles/
├── nginx/
│   ├── tasks/
│   │   ├── main.yml       # 主任务入口
│   │   ├── install.yml    # 安装
│   │   └── configure.yml  # 配置
│   ├── handlers/
│   │   └── main.yml       # Handler
│   ├── templates/
│   │   └── nginx.conf.j2
│   ├── files/
│   │   └── index.html
│   ├── vars/
│   │   └── main.yml       # 默认变量
│   ├── defaults/
│   │   └── main.yml       # 低优先级变量 (可覆盖)
│   ├── meta/
│   │   └── main.yml       # 依赖声明
│   └── tests/
│       └── test.yml
# playbook 中使用 roles
---
- hosts: webservers
  roles:
    - role: nginx
      nginx_port: 8080          # 覆盖默认变量
      tags: [nginx]
    - role: app
      tags: [app]

5.2 其他自动化工具对比

特性 Ansible SaltStack Puppet Chef
架构 无 agent (SSH) Agent + Master Agent + Master Agent + Master
配置语言 YAML YAML + Python 自定义 DSL Ruby DSL
学习曲线
实时性 推送模型 事件驱动(快) 拉取(30min) 拉取(30min)
社区 最大 中型 大型 中型
适用场景 通用/中小规模 大规模/实时 大规模合规 大规模/复杂

第六章:安全加固

6.1 系统基础安全

# === 1. SSH 加固 ===
# /etc/ssh/sshd_config
Port 2222                              # 修改默认端口
Protocol 2                             # 仅 SSHv2
PermitRootLogin no                     # 禁止 root 登录
PasswordAuthentication no              # 禁用密码认证
PubkeyAuthentication yes               # 仅密钥认证
MaxAuthTries 3                         # 最大尝试次数
ClientAliveInterval 300                # 客户端心跳
ClientAliveCountMax 2                  # 推送周期
AllowUsers [email protected].*            # 限制用户和来源 IP
X11Forwarding no                       # 禁止 X11 转发
MaxSessions 5                          # 单连接最大会话数
LoginGraceTime 30                      # 认证超时
MaxStartups 10:30:60                   # 未认证连接限制

systemctl restart sshd

# === 2. 密码策略 ===
# /etc/login.defs
PASS_MAX_DAYS   90     # 密码 90 天过期
PASS_MIN_DAYS   7      # 修改后 7 天内不可再改
PASS_MIN_LEN    12     # 最小 12 位
PASS_WARN_AGE   14     # 过期前 14 天警告

# /etc/security/pwquality.conf
minlen = 12
dcredit = -1            # 至少 1 个数字
ucredit = -1            # 至少 1 个大写
lcredit = -1            # 至少 1 个小写
ocredit = -1            # 至少 1 个特殊字符
minclass = 4            # 至少 4 种字符类别
maxrepeat = 3           # 最多连续重复 3 次
maxclassrepeat = 3      # 同类字符最多连续 3 个
difok = 5               # 新密码与旧密码至少不同 5 个字符
enforce_for_root        # root 也适用

# === 3. 账户锁定 ===
# /etc/pam.d/sshd
# 连续 5 次失败后锁定 600 秒
auth required pam_tally2.so deny=5 unlock_time=600 onerr=fail audit

# === 4. 会话超时 ===
echo "TMOUT=600" >> /etc/profile
echo "readonly TMOUT" >> /etc/profile
echo "export TMOUT" >> /etc/profile

# === 5. 历史命令限制 ===
echo 'HISTSIZE=500' >> /etc/profile
echo 'HISTFILESIZE=500' >> /etc/profile
echo "readonly HISTSIZE HISTFILESIZE" >> /etc/profile
echo 'export HISTTIMEFORMAT="%F %T "' >> /etc/profile

# === 6. 限制 su/sudo ===
# 仅 wheel 组可 su
# /etc/pam.d/su
auth required pam_wheel.so use_uid

# sudo 日志审计
# /etc/sudoers
Defaults logfile=/var/log/sudo.log
Defaults log_input,log_output     # 记录输入输出 (需 sudo 1.9+)
Defaults requiretty               # 必须有 tty

6.2 防火墙管理

firewalld (RHEL/CentOS 7+)

# === 基础操作 ===
systemctl start firewalld
systemctl enable firewalld
firewall-cmd --state

# 查看
firewall-cmd --list-all                    # 当前区域详情
firewall-cmd --get-default-zone            # 默认区域
firewall-cmd --get-active-zones            # 活动区域
firewall-cmd --list-services              # 已允许服务
firewall-cmd --list-ports                 # 已允许端口

# 规则管理
firewall-cmd --add-port=8080/tcp --permanent    # 永久开放端口
firewall-cmd --add-service=http --permanent      # 开放服务
firewall-cmd --add-rich-rule='rule family="ipv4" source address="192.168.1.0/24" port port="22" protocol="tcp" accept' --permanent  # 仅允许特定 IP 段访问 SSH
firewall-cmd --remove-port=8080/tcp --permanent  # 删除

# 重载
firewall-cmd --reload

# 区域切换
firewall-cmd --set-default-zone=dmz
firewall-cmd --change-interface=ens33 --zone=trusted --permanent

# === 生产级规则示例 ===
# 默认拒绝并只开放必要端口
firewall-cmd --set-default-zone=drop

firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/8" port port="22" protocol="tcp" accept'
firewall-cmd --permanent --add-service=http
firewall-cmd --permanent --add-service=https
firewall-cmd --permanent --add-port=9100/tcp    # node_exporter
firewall-cmd --reload

# === 端口转发 (NAT) ===
firewall-cmd --permanent --add-masquerade
firewall-cmd --permanent --add-forward-port=port=80:proto=tcp:toport=8080:toaddr=192.168.1.100
firewall-cmd --reload

iptables (传统/通用)

# === 默认策略 ===
iptables -P INPUT DROP
iptables -P FORWARD DROP
iptables -P OUTPUT ACCEPT

# === 允许回环 ===
iptables -A INPUT -i lo -j ACCEPT

# === 允许已建立连接 ===
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# === 允许 SSH ===
iptables -A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT

# === 允许 Web ===
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 443 -j ACCEPT

# === 允许 ICMP (ping) ===
iptables -A INPUT -p icmp --icmp-type echo-request -j ACCEPT

# === 防 DDoS ===
# 限制 SYN 包速率
iptables -A INPUT -p tcp --syn -m limit --limit 10/s --limit-burst 20 -j ACCEPT
iptables -A INPUT -p tcp --syn -j DROP

# 限制单个 IP 并发连接
iptables -A INPUT -p tcp --dport 80 -m connlimit --connlimit-above 50 -j DROP

# === 端口转发 ===
iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-destination 192.168.1.100:8080
iptables -t nat -A POSTROUTING -j MASQUERADE

# === 保存规则 ===
iptables-save > /etc/sysconfig/iptables       # CentOS 6
iptables-save > /etc/iptables/rules.v4        # Debian/Ubuntu
netfilter-persistent save                      # iptables-persistent

6.3 SELinux 管理

# === 状态查看 ===
getenforce                          # 查看模式
sestatus -v                         # 详细状态

# === 模式切换 ===
setenforce 0                        # 临时切换为 Permissive
setenforce 1                        # 启用 Enforcing

# 永久配置
# /etc/selinux/config
# SELINUX=enforcing|permissive|disabled

# === 上下文管理 ===
ls -Z /var/www/html/                # 查看文件上下文
ps -Z                               # 查看进程上下文

chcon -R -t httpd_sys_content_t /var/www/html/    # 修改上下文
restorecon -Rv /var/www/html/                      # 恢复默认上下文

semanage fcontext -a -t httpd_sys_content_t "/web(/.*)?"
restorecon -Rv /web

# === 布尔值管理 ===
getsebool -a                        # 列出所有布尔值
setsebool -P httpd_can_network_connect on   # 允许 Apache 网络连接
setsebool -P httpd_enable_homedirs on       # 允许用户目录

# === 端口管理 ===
semanage port -l                          # 列出所有端口
semanage port -a -t http_port_t -p tcp 8080  # 添加端口到类型

# === 审计日志排错 ===
ausearch -m avc -ts recent           # 查看最近的 AVC 拒绝
sealert -a /var/log/audit/audit.log  # 分析审计日志
audit2allow -a -M mypol              # 从审计日志生成策略模块
semodule -i mypol.pp                 # 安装自定义模块

# === SELinux 排错流程 ===
# 1. 查看审计日志
ausearch -m avc -ts today | grep denied
# 2. 分析并提供建议
audit2why < /var/log/audit/audit.log
# 3. 临时切换 permissive 排查
setenforce 0
# 4. 测试应用
# 5. 查看生成的 AVC
ausearch -m avc -ts recent
# 6. 创建自定义策略
grep denied /var/log/audit/audit.log | audit2allow -M custom_policy
semodule -i custom_policy.pp
# 7. 恢复 enforcing
setenforce 1

6.4 fail2ban (防暴力破解)

# 安装
yum install -y fail2ban      # CentOS
apt-get install -y fail2ban  # Ubuntu

# /etc/fail2ban/jail.local
[DEFAULT]
ignoreip = 127.0.0.1/8 10.0.0.0/8 192.168.0.0/16
bantime  = 3600                    # 封禁时间 (秒)
findtime = 600                     # 统计窗口 (秒)
maxretry = 5                       # 最大失败次数
destemail = [email protected]
action = %(action_mw)s             # 封禁 + whois + 邮件
banaction = iptables-multiport

[sshd]
enabled  = true
port     = ssh,2222
logpath  = %(sshd_log)s
maxretry = 3

[nginx-http-auth]
enabled  = true
port     = http,https
logpath  = /var/log/nginx/error.log
maxretry = 5

[nginx-botsearch]
enabled  = true
port     = http,https
logpath  = /var/log/nginx/access.log
maxretry = 3
findtime = 300

[mysqld-auth]
enabled  = true
port     = 3306
logpath  = /var/log/mysql/error.log
maxretry = 5

# 管理命令
fail2ban-client status                       # 查看状态
fail2ban-client status sshd                  # 查看 sshd jail
fail2ban-client set sshd unbanip 1.2.3.4     # 手动解封
fail2ban-client set sshd banip 1.2.3.4       # 手动封禁
fail2ban-client reload                       # 重载配置

# 查看封禁日志
grep "Ban" /var/log/fail2ban.log

6.5 系统审计 (auditd)

# 安装
yum install -y audit             # CentOS
apt-get install -y auditd        # Ubuntu

systemctl enable --now auditd

# === 审计规则 ===
# /etc/audit/rules.d/audit.rules

# 监控关键文件
-w /etc/passwd -p wa -k identity_changes
-w /etc/shadow -p wa -k identity_changes
-w /etc/group -p wa -k identity_changes
-w /etc/sudoers -p wa -k sudo_changes
-w /etc/ssh/sshd_config -p wa -k sshd_config
-w /etc/crontab -p wa -k cron_changes

# 监控关键命令执行
-a always,exit -F path=/usr/bin/su -F perm=x -k su_exec
-a always,exit -F path=/usr/bin/sudo -F perm=x -k sudo_exec

# 监控系统调用
-a always,exit -F arch=b64 -S execve -k command_exec

# 监控网络配置修改
-a always,exit -F path=/sbin/ifconfig -F perm=x -k net_config

# 监控时间修改
-a always,exit -F arch=b64 -S adjtimex -S settimeofday -k time_change
-a always,exit -F arch=b64 -S clock_settime -k time_change

# === 审计查询 ===
ausearch -k identity_changes            # 按 key 查询
ausearch -f /etc/passwd                 # 按文件查询
ausearch -p 1234                        # 按 PID 查询
ausearch -ua root                       # 按用户查询
ausearch -ts today                      # 今天的审计记录
ausearch -m USER_LOGIN                  # 登录事件

# 生成报告
aureport -l                             # 登录报告
aureport -k                             # key 汇总
aureport -f                             # 文件审计报告
aureport --summary                      # 摘要

# 搜索失败事件
ausearch -m USER_LOGIN --success no

6.6 安全扫描

# === Lynis (系统安全审计) ===
# 安装
git clone https://github.com/CISOfy/lynis
cd lynis && ./lynis audit system

# 快速审计
lynis audit system --quick

# === ClamAV (病毒扫描) ===
# 安装
yum install -y clamav clamav-update    # CentOS
apt-get install -y clamav              # Ubuntu

freshclam                     # 更新病毒库
clamscan -r /data             # 递归扫描
clamscan -r --remove /tmp     # 扫描并删除

# 定期扫描 crontab
# 0 2 * * 0 clamscan -r /data --log=/var/log/clamav/scan.log

# === AIDE (文件完整性检查) ===
yum install -y aide
aide --init                   # 初始化数据库
cp /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
aide --check                  # 检查变更
aide --update                 # 更新基线数据库

# === OpenSCAP (合规检查) ===
yum install -y openscap-scanner scap-security-guide

# 检查系统合规性 (CIS 基线)
oscap xccdf eval \
    --profile xccdf_org.ssgproject.content_profile_cis \
    --results scan-results.xml \
    --report scan-report.html \
    /usr/share/xml/scap/ssg/content/ssg-rhel8-ds.xml

第七章:性能调优

7.1 内核参数调优

# === 核心内核参数 (/etc/sysctl.d/99-tuning.conf) ===

# ===== 网络调优 =====
# TCP 连接复用 (快速回收 TIME_WAIT)
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 0          # 4.12+ 已移除,设为 0

# TIME_WAIT 与端口范围
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_max_tw_buckets = 10000
net.ipv4.ip_local_port_range = 1024 65000

# TCP 缓冲区 (高吞吐场景)
net.core.rmem_max = 134217728        # 128MB
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.ipv4.tcp_mem = 50576 64768 98152

# TCP Fast Open
net.ipv4.tcp_fastopen = 3            # 客户端 + 服务端

# BBR 拥塞控制 (4.9+)
net.core.default_qdisc = fq
net.ipv4.tcp_congestion_control = bbr

# 连接队列
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 8192
net.core.netdev_max_backlog = 10000

# Keepalive
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3

# SYN Flood 防护
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_syn_retries = 3

# ===== 文件系统与 IO =====
# VM 参数
vm.swappiness = 1                    # 尽量不用 swap (SSD 推荐 1)
vm.dirty_ratio = 30
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000
vm.dirty_writeback_centisecs = 500
vm.vfs_cache_pressure = 50          # 保留更多 inode/dentry 缓存
vm.min_free_kbytes = 131072          # 128MB 最小空闲内存

# 文件描述符
fs.file-max = 655350
fs.nr_open = 1048576
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 1024
fs.aio-max-nr = 1048576

# ===== 内核调度 =====
kernel.pid_max = 4194303
kernel.threads-max = 256000
kernel.msgmax = 65536
kernel.msgmnb = 65536

# 应用配置
sysctl -p /etc/sysctl.d/99-tuning.conf

7.2 资源限制

# /etc/security/limits.conf
# <domain> <type> <item> <value>
*       soft    nofile          65535
*       hard    nofile          65535
*       soft    nproc           65535
*       hard    nproc           65535
root    soft    nofile          65535
root    hard    nofile          65535
nginx   soft    nofile          100000
nginx   hard    nofile          100000
mysql   soft    nofile          100000
mysql   hard    nofile          100000

# systemd 服务资源限制
# /etc/systemd/system/myservice.service.d/limits.conf
[Service]
LimitNOFILE=65535
LimitNPROC=65535
LimitCORE=infinity
MemoryLimit=2G
CPUQuota=200%
TasksMax=2048

7.3 CPU 性能优化

# === CPU 调度策略 ===
# 查看当前策略
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# 设为 performance (服务器推荐)
# CentOS/RHEL
echo "GOVERNOR=performance" >> /etc/sysconfig/cpufreq

# 或直接
for CPU in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $CPU
done

# === CPU 亲和性 (IRQ 绑定) ===
# 查看中断
cat /proc/interrupts

# 将网卡中断绑定到特定 CPU
echo 2 > /proc/irq/89/smp_affinity    # 绑定到 CPU1

# 使用 irqbalance (自动平衡,一般开启即可)
systemctl enable --now irqbalance

# NUMA 感知
numactl --hardware                            # 查看 NUMA 拓扑
numactl --cpunodebind=0 --membind=0 nginx     # 绑定到 NUMA node 0

# === CPU 隔离 (实时/低延迟场景) ===
# /etc/default/grub
# GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"
# 然后: grub2-mkconfig -o /boot/grub2/grub.cfg

7.4 内存调优

# === 查看内存状况 ===
free -h
cat /proc/meminfo
vmstat 1 10

# === 查看进程内存详情 ===
# PSS (比例分摊共享内存)
smem -r -s pss

# 大页内存 (HugePages)
# 适合大内存数据库
echo "vm.nr_hugepages = 1024" >> /etc/sysctl.d/99-hugepages.conf
sysctl -p /etc/sysctl.d/99-hugepages.conf

# 查看大页使用
cat /proc/meminfo | grep Huge

# transparent hugepage (数据库通常建议关闭)
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# === OOM Killer 控制 ===
# 保护关键进程
echo -1000 > /proc/$(pgrep sshd)/oom_score_adj   # 永不 kill (范围 -1000 ~ 1000)

# /etc/systemd/system/mysqld.service.d/oom.conf
[Service]
OOMScoreAdjust=-800

# === 内存泄漏排查 ===
# 监控进程内存增长
while true; do
    ps -eo pid,ppid,cmd,%mem,%cpu,rss --sort=-rss | head -20
    sleep 10
done

7.5 磁盘 IO 调优

# === IO 调度器 ===
# 查看当前调度器
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none

# 设置 (SSD 推荐 none / mq-deadline)
echo none > /sys/block/sda/queue/scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler

# 永久设置 (grub)
# GRUB_CMDLINE_LINUX="elevator=noop"

# === 磁盘队列深度 ===
cat /sys/block/sda/queue/nr_requests
echo 1024 > /sys/block/sda/queue/nr_requests

# === 预读大小 ===
blockdev --getra /dev/sda
blockdev --setra 8192 /dev/sda   # 设为 4MB (8192 个扇区)

# === 文件系统挂载选项 ===
# SSD 优化
# /etc/fstab
UUID=xxx  /data  ext4  defaults,noatime,nodiratime,discard  0  0

# noatime     : 不记录访问时间
# nodiratime  : 不记录目录访问时间
# discard     : 启用 TRIM (或使用 fstrim)
# nobarrier   : 关闭写屏障 (有电池的 RAID 卡)

# === fstrim (SSD TRIM) ===
fstrim -v /data               # 手动 TRIM
systemctl enable fstrim.timer # 启动定时 TRIM

# === IO 性能测试 ===
fio --name=randwrite --ioengine=libaio --iodepth=32 --rw=randwrite \
    --bs=4k --size=2G --numjobs=4 --runtime=60 --group_reporting \
    --filename=/data/test --direct=1

fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread \
    --bs=4k --size=2G --numjobs=4 --runtime=60 --group_reporting \
    --filename=/data/test --direct=1

# === iotop (实时 IO 监控) ===
iotop -o           # 仅显示有 IO 的进程
iotop -oP          # 进程级别
iotop -b -n 3      # 批处理模式,3 次

7.6 性能分析工具速查

工具 用途 典型命令
top/htop 进程监控 htop -u mysql
vmstat 内存/IO/CPU vmstat 1 10
iostat 磁盘 IO iostat -xz 1
sar 系统活动报告 sar -n DEV 1
mpstat CPU 统计 mpstat -P ALL 1
pidstat 进程性能 pidstat -d 1
perf 性能采样 perf top -g
strace 系统调用追踪 strace -c -p PID
ltrace 库调用追踪 ltrace -p PID
bpftrace 动态追踪 bpftrace -e 'kprobe:vfs_read { @[comm]=count(); }'
dstat 综合系统资源 dstat -tcmdns
nethogs 进程网络流量 nethogs ens33
iperf3 网络带宽测试 iperf3 -s / iperf3 -c host

第八章:高可用与负载均衡

8.1 Keepalived (VRRP 高可用)

8.1.1 原理

VIP: 192.168.1.100 (虚拟 IP)

┌─────────────────┐     ┌─────────────────┐
│    Master        │     │    Backup       │
│  192.168.1.11    │────▶│  192.168.1.12   │
│  priority=100    │ VRRP│  priority=90    │
└─────────────────┘     └─────────────────┘
        │                       │
        └───────────┬───────────┘
                    │
              ┌─────┴─────┐
              │ 后端服务    │
              │ 192.168.1.20│
              └───────────┘

8.1.2 Keepalived 配置

yum install -y keepalived    # CentOS
apt-get install -y keepalived  # Ubuntu
# /etc/keepalived/keepalived.conf (Master)
global_defs {
    router_id web_lb_01
    # 通知脚本
    notification_email {
        [email protected]
    }
    notification_email_from [email protected]
    smtp_server smtp.example.com
    smtp_connect_timeout 30
}

# 健康检查脚本
vrrp_script chk_nginx {
    script "/usr/bin/killall -0 nginx"    # 检查 nginx 进程
    interval 2
    weight -20
    fall 3          # 连续 3 次失败触发切换
    rise 2          # 连续 2 次成功恢复
}

vrrp_instance VI_1 {
    state MASTER
    interface ens33
    virtual_router_id 51
    priority 100
    advert_int 1
    nopreempt          # 不抢占 (故障恢复后不自动切回)

    authentication {
        auth_type PASS
        auth_pass your_password
    }

    virtual_ipaddress {
        192.168.1.100/24 dev ens33
    }

    track_script {
        chk_nginx        # 关联健康检查
    }

    # 状态切换通知
    notify_master "/opt/scripts/notify.sh master"
    notify_backup "/opt/scripts/notify.sh backup"
    notify_fault  "/opt/scripts/notify.sh fault"
}
# /etc/keepalived/keepalived.conf (Backup)
global_defs {
    router_id web_lb_02
}

vrrp_script chk_nginx {
    script "/usr/bin/killall -0 nginx"
    interval 2
    weight -20
    fall 3
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface ens33
    virtual_router_id 51
    priority 90
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass your_password
    }

    virtual_ipaddress {
        192.168.1.100/24 dev ens33
    }

    track_script {
        chk_nginx
    }
}

# 允许非本地 IP 绑定 (使得 Backup 也能绑定 VIP)
# echo "net.ipv4.ip_nonlocal_bind = 1" >> /etc/sysctl.d/99-keepalived.conf
# sysctl -p /etc/sysctl.d/99-keepalived.conf

8.2 HAProxy 负载均衡

# 安装
yum install -y haproxy    # CentOS
apt-get install -y haproxy  # Ubuntu

# === 完整配置 ===
# /etc/haproxy/haproxy.cfg

global
    log /dev/log local0
    log /dev/log local1 notice
    chroot /var/lib/haproxy
    user haproxy
    group haproxy
    daemon
    maxconn 50000
    spread-checks 5
    stats socket /var/run/haproxy.sock mode 600 level admin
    stats timeout 2m
    tune.ssl.default-dh-param 2048

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  redispatch
    retries 3
    timeout connect 5s
    timeout client  50s
    timeout server  50s
    timeout http-request 10s
    timeout http-keep-alive 10s
    timeout check 5s
    maxconn 5000

# === 前端 ===
frontend web_frontend
    bind *:80
    bind *:443 ssl crt /etc/haproxy/certs/combined.pem alpn h2,http/1.1
    # HTTP 重定向到 HTTPS
    redirect scheme https if !{ ssl_fc }

    # ACL
    acl is_api path_beg /api
    acl is_admin path_beg /admin
    acl is_static path_end .jpg .png .css .js .woff2
    acl blocked_ua hdr_sub(User-Agent) -i curl wget

    # 按路径路由
    use_backend api_backend if is_api
    use_backend admin_backend if is_admin
    use_backend static_backend if is_static
    default_backend web_backend

    # 拒绝特定 User-Agent
    http-request deny if blocked_ua

    # 限速 (每 IP 每秒 100 请求)
    stick-table type ip size 1m expire 10s store http_req_rate(10s)
    http-request track-sc0 src
    http-request deny if { sc_http_req_rate(0) gt 100 }

# === 后端 ===
backend web_backend
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200
    default-server inter 3s rise 2 fall 3 maxconn 1000

    server web01 192.168.1.11:8080 check weight 100
    server web02 192.168.1.12:8080 check weight 100
    server web03 192.168.1.13:8080 check weight 100 backup  # 备用节点

    # Cookie 会话保持
    cookie SERVERID insert indirect nocache

    # 长连接
    option http-keep-alive

api_backend
    balance leastconn
    option httpchk GET /api/health
    http-check expect status 200
    default-server inter 2s rise 2 fall 2

    server api01 192.168.1.11:8081 check
    server api02 192.168.1.12:8081 check

static_backend
    balance uri
    option httpchk HEAD /health

    server static01 192.168.1.11:8082 check
    server static02 192.168.1.12:8082 check

# === TCP 模式 (MySQL 代理) ===
listen mysql_proxy
    bind *:3307
    mode tcp
    balance leastconn
    option mysql-check user haproxy_check

    server db01 192.168.1.21:3306 check inter 3s
    server db02 192.168.1.22:3306 check inter 3s backup

# === 统计页面 ===
listen stats
    bind *:9000
    mode http
    stats enable
    stats uri /stats
    stats realm HAProxy\ Statistics
    stats auth admin:your_password
    stats refresh 10s
    stats admin if TRUE

负载均衡算法对比

算法 适用场景 说明
roundrobin 通用 Web 轮询,权重越大分配越多
leastconn 长连接 (DB/WebSocket) 最少连接优先
source 需要会话保持 源 IP 哈希
uri 静态文件/Cache URI 哈希 (配合缓存)
url_param 带参数路由 URL 参数哈希
hdr HTTP 头路由 基于 HTTP Header
first 最小连接组 第一台可用

8.3 LVS (Linux Virtual Server)

# === DR 模式 (Direct Routing, 性能最高) ===
# Director 配置 (192.168.1.10)
ipvsadm -A -t 192.168.1.100:80 -s wrr
ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.11 -g -w 100
ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.12 -g -w 100
ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.13 -g -w 80

ipvsadm -Ln    # 查看规则
ipvsadm -Sn    # 保存规则

# Real Server 配置 (每台)
ifconfig lo:0 192.168.1.100 netmask 255.255.255.255 up
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce

# === 调度算法 ===
# rr     : 轮询
# wrr    : 加权轮询
# lc     : 最少连接
# wlc    : 加权最少连接 (默认)
# lblc   : 基于局部性最少连接
# dh     : 目标哈希
# sh     : 源地址哈希

8.4 高可用方案对比

方案 层次 性能 复杂度 适用场景
Keepalived + Nginx L3/L7 Web 服务
Keepalived + HAProxy L3/L7 通用 4/7 层
LVS + Keepalived L4 极高 大规模流量入口
Nginx + Nginx L7 中小 Web
云 LB (SLB/ELB) L4/L7 极高 云环境
DNS 轮询 DNS 极低 简单分发

第九章:网络诊断与排错

9.1 网络诊断方法论

应用层 → 检查服务状态、端口监听、应用日志
传输层 → 检查端口连通性、防火墙规则、连接状态
网络层 → 检查路由、IP 配置、ICMP 可达性
链路层 → 检查 ARP、网卡状态、交换机端口
物理层 → 检查网线、光模块、网卡灯

排错顺序 (自底向上)

  1. 物理链路 (ethtool, ip link)
  2. IP 配置 (ip addr, ip route)
  3. 网关可达性 (ping 网关, traceroute)
  4. DNS 解析 (dig, nslookup)
  5. 端口连通性 (telnet, nc, nmap)
  6. 服务状态 (ss, netstat, 应用日志)

9.2 tcpdump 抓包分析

# === 基础抓包 ===
tcpdump -i any -nn                           # 所有接口,不解析主机名和端口
tcpdump -i ens33 -nn host 192.168.1.100      # 过滤主机
tcpdump -i ens33 -nn port 80                 # 过滤端口
tcpdump -i ens33 -nn src 192.168.1.100       # 源地址
tcpdump -i ens33 -nn dst port 443            # 目标端口
tcpdump -i ens33 -nn tcp                     # 仅 TCP

# === 组合过滤 ===
tcpdump -i ens33 -nn \
  '(host 192.168.1.100 and port 80) or (host 192.168.1.200 and port 443)'

# === 实战场景 ===
# 抓取 HTTP 请求
tcpdump -i ens33 -A -s 0 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)'

# 抓取特定 HTTP 方法
tcpdump -i ens33 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'   # GET
tcpdump -i ens33 -s 0 -A 'tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x504f5354'   # POST

# 抓取 DNS 查询
tcpdump -i ens33 -nn port 53

# 抓取特定标志位
tcpdump -i ens33 -nn 'tcp[tcpflags] & (tcp-syn|tcp-fin) != 0'  # SYN 或 FIN
tcpdump -i ens33 -nn 'tcp[tcpflags] & tcp-rst != 0'             # RST 包

# 保存到文件
tcpdump -i ens33 -w /tmp/capture.pcap -s 0 host 192.168.1.100
tcpdump -r /tmp/capture.pcap -nn          # 读取 pcap 文件

# 限制抓包数量
tcpdump -i ens33 -nn -c 100               # 抓 100 个包后停止

TCP 状态分析

# 三次握手问题分析
tcpdump -i ens33 -nn 'tcp[tcpflags] & tcp-syn != 0 and tcp[tcpflags] & tcp-ack == 0'

# 大量 RST → 端口未监听 / 防火墙 REJECT
tcpdump -i ens33 -nn 'tcp[tcpflags] & tcp-rst != 0'

# 重传统计
tcpdump -i ens33 -nn 'tcp[tcpflags] & tcp-syn != 0 or tcp[tcpflags] & tcp-rst != 0'

9.3 网络故障排查工具

# === 连通性测试 ===
ping -c 4 -i 0.2 192.168.1.1              # 快速 ping
ping -M do -s 1472 192.168.1.1             # 测试 MTU (禁止分片)
mtr -r -c 10 192.168.1.1                   # 路由追踪 + 统计

# === 路由诊断 ===
ip route get 8.8.8.8                       # 查看到目标的实际路由
traceroute -n 8.8.8.8                      # 路由追踪
tracepath 8.8.8.8                          # MTU 发现 + 路由追踪

# === DNS 诊断 ===
dig +short example.com                     # 简洁输出
dig example.com ANY                        # 所有记录
dig @8.8.8.8 example.com                  # 指定 DNS 服务器
dig -x 8.8.8.8                             # 反向解析
nslookup example.com                       # 交互式查询

# === 端口检测 ===
nc -zv 192.168.1.100 80                   # TCP 端口扫描
nc -zuv 192.168.1.100 53                  # UDP 端口扫描
timeout 3 bash -c '</dev/tcp/192.168.1.100/80 && echo OPEN || echo CLOSED'

# === 扫描工具 ===
nmap -sS 192.168.1.0/24                   # SYN 半连接扫描
nmap -sT -p 1-65535 192.168.1.100         # 全端口扫描
nmap -sV -p 80,443 192.168.1.100          # 服务版本探测
nmap -A 192.168.1.100                      # 综合扫描 (OS + 服务 + 脚本)

# === 连接状态分析 ===
ss -s                                       # 连接统计摘要
ss -tapn                                    # 所有 TCP 连接
ss -tlnp                                    # 监听端口
ss -tan state time-wait                     # TIME_WAIT 状态
ss -tan state established                    # 已建立连接

# 统计各状态连接数
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn

# 统计每个 IP 的连接数
ss -tan | awk 'NR>1{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -20

9.4 带宽与延迟分析

# === iperf3 带宽测试 ===
ipserf3 -s                                   # 服务端
iperf3 -c server_ip -t 30 -P 4              # 4 并发, 30 秒
iperf3 -c server_ip -R                        # 反向 (下载)
iperf3 -c server_ip -u -b 100M               # UDP 100Mbps

# === 网卡统计 ===
ethtool -S ens33                              # 网卡详细统计
ethtool ens33                                 # 网卡设置
ethtool -g ens33                              # Ring buffer 大小

# === 实时流量 ===
iftop -i ens33                                # 实时带宽
nload ens33                                   # 实时流量图

# === HTTP 延迟分析 ===
curl -w "time_namelookup: %{time_namelookup}\ntime_connect: %{time_connect}\ntime_starttransfer: %{time_starttransfer}\ntime_total: %{time_total}\n" -o /dev/null -s https://example.com

9.5 常见网络问题速查

现象 可能原因 检查命令
ping 通但端口不通 防火墙/服务未启动 ss -tlnp, firewall-cmd --list-ports
间歇性丢包 网卡/交换机/带宽饱和 ethtool -S | grep drop, netstat -s
TCP 连接大量 TIME_WAIT 短连接过多 ss -tan state time-wait | wc -l
DNS 解析慢 DNS 服务器问题 dig +stats
SSH 连接慢 DNS 反向解析 /etc/ssh/sshd_config: UseDNS no
大文件传输慢 MTU/TCP 窗口 tracepath, 调整 tcp_rmem/wmem
大量 SYN_RECV SYN Flood / backlog 不够 ss -tan state syn-recv, tcp_max_syn_backlog
curl 卡住不动 防火墙 DROP (无 RST) tcpdump 确认是否收到 SYN-ACK

第十章:故障排查实战

10.1 CPU 飙升排查

# 1. 确认高 CPU 进程
top -bn1 -o %CPU | head -20

# 2. 查看进程中的高 CPU 线程
top -H -p <PID>

# 3. 线程 ID 转十六进制 (用于 Java 线程 dump)
printf "%x\n" <TID>

# 4. 查看系统调用
strace -c -p <PID>                  # 统计系统调用耗时
strace -p <PID> -T                  # 显示每个调用耗时

# 5. perf 分析
perf top -g -p <PID>                # 实时采样
perf record -g -p <PID> -- sleep 30 # 记录 30 秒
perf report                          # 查看报告

# 6. Java 应用
jstack <PID>                        # 线程 dump
jstack <PID> | grep -A 20 "0x$(printf "%x" <TID>)"

10.2 内存问题排查

# 1. 概览
free -h
cat /proc/meminfo

# 2. 进程内存排序
ps aux --sort=-%mem | head -20
ps -eo pid,ppid,cmd,%mem,%cpu,rss --sort=-rss | head -20

# 3. 进程内存详情
cat /proc/<PID>/smaps | grep -E "^(Rss|Pss|Swap):" | awk '{sum+=$2} END {print sum/1024" MB"}'
cat /proc/<PID>/status | grep -E "Vm|Threads"

# 4. 检查是否有内存泄漏
for i in {1..10}; do
    cat /proc/<PID>/status | grep VmRSS
    sleep 5
done

# 5. slab 内存 (内核)
slabtop -s c

# 6. 查看 OOM 历史
dmesg | grep -i "out of memory"
grep -i "killed process" /var/log/messages
journalctl -k | grep -i oom

# 7. OOM Killer 保护关键进程
echo -1000 > /proc/$(pgrep sshd)/oom_score_adj

10.3 磁盘空间问题

# === 磁盘满排查流程 ===

# 1. 确认磁盘使用
df -h

# 2. 哪个目录占用大
du -sh /* 2>/dev/null | sort -rh | head -20
du -sh /var/* 2>/dev/null | sort -rh | head -10

# 3. 大文件查找
find / -type f -size +500M -exec ls -lh {} \; 2>/dev/null
find / -type f -size +1G 2>/dev/null

# 4. 已删除但未释放的文件 (进程仍持有)
lsof | grep deleted | awk '{print $1,$2,$7}' | sort -u
lsof +L1 | grep deleted

# 5. inode 耗尽检查
df -i
find / -xdev -type f | cut -d/ -f2 | sort | uniq -c | sort -rn | head -20

# 6. 快速清理
journalctl --vacuum-size=500M
find /var/log -type f -name "*.log" -mtime +30 -delete
yum clean all || apt-get clean
find /tmp -type f -mtime +7 -delete
docker system prune -af 2>/dev/null

10.4 服务无法启动排查

# === 排查流程 ===

# 1. 查看服务日志
journalctl -u <service> -n 100 --no-pager
systemctl status <service> -l

# 2. 查看系统日志
tail -n 200 /var/log/messages
dmesg | tail -50

# 3. 检查端口冲突
ss -tlnp | grep <PORT>

# 4. 检查文件权限
ls -la /path/to/app/
ls -laZ /path/to/app/            # SELinux 上下文

# 5. 检查依赖
ldd /path/to/binary              # 库依赖

# 6. 手动启动排查
sudo -u <user> /path/to/binary   # 看报错信息

# 7. SELinux 排查
ausearch -m avc -ts recent
grep denied /var/log/audit/audit.log | tail -20
setenforce 0            # 临时禁用测试
# 测试后恢复
setenforce 1

# 8. 资源限制检查
cat /proc/<PID>/limits
ulimit -a

10.5 网站访问慢排查

# 1. DNS 解析
dig +stats example.com

# 2. TCP 连接
time nc -zv example.com 443

# 3. HTTP 全链路耗时
curl -w "time_namelookup: %{time_namelookup}\ntime_connect: %{time_connect}\ntime_appconnect: %{time_appconnect}\ntime_starttransfer: %{time_starttransfer}\ntime_total: %{time_total}\n" -o /dev/null -s https://example.com

# 4. SSL 握手
echo | openssl s_client -connect example.com:443 -servername example.com 2>&1 | grep -E "Verify|time|session"

# 5. 后端耗时分析
tail -1000 /var/log/nginx/access.log | awk '{print $NF}' | sort -rn | head -20

# 6. 数据库慢查询
mysql -e "SHOW FULL PROCESSLIST;"
psql -c "SELECT pid, now() - query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;"

# 7. 系统资源瓶颈
top -bn1 | head -5
iostat -xz 1 5
sar -n DEV 1 5

10.6 应急诊断脚本

#!/bin/bash
# emergency_diag.sh - 应急诊断,收集关键信息
# 用法: bash emergency_diag.sh > diag_$(date +%Y%m%d_%H%M).txt

echo "========== 诊断开始: $(date) =========="
echo "主机: $(hostname)"
echo

echo "=== 系统负载 ==="
uptime
echo

echo "=== CPU TOP 10 ==="
ps aux --sort=-%cpu | head -11
echo

echo "=== 内存 TOP 10 ==="
ps aux --sort=-%mem | head -11
echo

echo "=== 内存概览 ==="
free -h
echo

echo "=== 磁盘使用 ==="
df -h
echo

echo "=== inode 使用 ==="
df -i
echo

echo "=== IO 统计 ==="
iostat -xz 1 3
echo

echo "=== 网络监听 ==="
ss -tlnp
echo

echo "=== 连接统计 ==="
ss -s
echo

echo "=== TIME_WAIT 数量 ==="
ss -tan state time-wait | wc -l
echo

echo "=== 各状态连接数 ==="
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn
echo

echo "=== 最近系统日志 (error) ==="
journalctl -p err -n 50 --no-pager
echo

echo "=== 内核日志 (最近) ==="
dmesg | tail -30
echo

echo "=== OOM 记录 ==="
dmesg | grep -i "out of memory"
echo

echo "===== 诊断结束: $(date) ====="

第十一章:容器化运维

11.1 Docker 运维要点

11.1.1 Docker 资源限制

# === 内存限制 ===
docker run -d --memory="512m" --memory-swap="1g" nginx

# === CPU 限制 ===
docker run -d --cpus="1.5" --cpu-shares=512 nginx

# === 限制验证 ===
docker stats <container>
docker inspect <container> | jq '.[0].HostConfig.Memory'
# Docker Compose 资源限制
services:
  app:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M

11.1.2 Docker 运维命令速查

# === 清理 ===
docker system df                              # 磁盘使用
docker system prune -af --volumes             # 清理所有未使用资源
docker builder prune -a -f                    # 清理构建缓存

# === 日志 ===
docker logs --tail 100 -f <container>
docker logs --since 10m -f <container>

# 限制日志大小 (/etc/docker/daemon.json)
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

# === 调试 ===
docker exec -it <container> sh
docker inspect <container> | jq .
docker stats --no-stream
docker cp <container>:/path/file ./local/path

# === 导出/导入 ===
docker export <container> -o container.tar
docker save <image> -o image.tar
docker load -i image.tar

11.1.3 Docker 生产环境 daemon.json

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": { "max-size": "10m", "max-file": "3" },
  "storage-driver": "overlay2",
  "registry-mirrors": ["https://mirror.ccs.tencentyun.com"],
  "max-concurrent-downloads": 10,
  "max-concurrent-uploads": 5,
  "live-restore": true,
  "userland-proxy": false,
  "default-ulimits": {
    "nofile": { "Name": "nofile", "Hard": 65535, "Soft": 65535 }
  },
  "oom-score-adjust": -500
}

11.2 Docker 故障排查

# 容器反复重启
docker logs --tail 50 <container>
docker inspect <container> --format '{{.State.OOMKilled}}'

# 检查退出码
docker inspect <container> --format '{{.State.ExitCode}}'
# 0: 正常退出, 137: SIGKILL(OOM/手动), 143: SIGTERM

# Docker 服务问题
journalctl -u docker -n 100
docker system df -v

# 清理构建缓存
docker builder prune --all --force --keep-storage 10GB

11.3 K8s 运维速查

(详细内容参见 Kubernetes-使用手册.md)

# === 节点管理 ===
kubectl get nodes -o wide
kubectl describe node <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl cordon <node>
kubectl uncordon <node>

# === Pod 调试 ===
kubectl describe pod <pod>
kubectl logs -f <pod> --tail=100
kubectl logs -f <pod> --previous
kubectl exec -it <pod> -- sh
kubectl debug -it <pod> --image=busybox --target=<container>

# === 资源使用 ===
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl top pods -A --sort-by=memory

# === 事件 ===
kubectl get events -A --sort-by='.lastTimestamp'
kubectl get events -A --field-selector type=Warning

# === etcd 备份 ===
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

第十二章:信创系统运维

12.1 信创操作系统概览

系统 基础 包管理 内核版本 适用场景
麒麟 V10 openEuler/Debian dpkg/rpm 4.19/5.10 党政/国防
统信 UOS Deepin/Debian dpkg 5.10 党政/企业桌面
openEuler 独立(华为) rpm(dnf) 5.10/6.6 服务器/云计算
Anolis OS CentOS 兼容 rpm(dnf) 5.10 服务器替代 CentOS
TencentOS CentOS 兼容 rpm(yum) 5.4 腾讯云
openSUSE 龙架构 openSUSE rpm(zypper) 6.x 龙芯平台

12.2 麒麟 V10 运维

# 版本查看
cat /etc/kylin-release
cat /proc/version

# 包管理 (SP1 基于 Debian, SP2/SP3 基于 openEuler)
# Debian 系列
apt-get update && apt-get install -y <package>

# openEuler 系列
dnf install -y <package>

# 安全策略 (默认启用安全加固)
getenforce           # SELinux 状态
aa-status            # AppArmor (Debian 系列)

# 国内源配置
# /etc/apt/sources.list (Debian 系列)
deb http://archive.kylinos.cn/kylin/KYLIN-ALL 10.1 main restricted universe multiverse

12.3 统信 UOS 运维

# 版本信息
cat /etc/os-version
cat /etc/deepin-version

# 包管理 (基于 Debian)
apt-get update && apt-get install -y <package>

# 开发者模式 (安装未经签名的包)
# 设置 → 通用 → 开发者模式

# 与标准 Debian 的主要区别
# 1. 内置安全加固 (安全中心)
# 2. 默认使用 Deepin 桌面
# 3. 部分包名不同 (deepin-terminal 替代 gnome-terminal)
# 4. 应用商店仅包含适配的国产软件

12.4 openEuler 运维

# 版本信息
cat /etc/openEuler-release

# 包管理 (dnf)
dnf makecache
dnf install -y <package>
dnf groupinstall -y "Development Tools"

# A-Tune (智能性能调优)
dnf install -y atune atune-engine
atune-adm list       # 查看优化模板
atune-adm analyze    # 系统分析

# iSulad (轻量容器引擎, Docker 替代)
dnf install -y iSulad
systemctl enable --now isulad
isula run -d nginx

# 内核特性 (默认启用 BBR)
sysctl net.ipv4.tcp_congestion_control

12.5 Anolis OS 运维 (CentOS 迁移)

# === 从 CentOS 8 迁移到 Anolis OS ===

# 1. 备份
cp -r /etc/yum.repos.d /etc/yum.repos.d.bak

# 2. 安装迁移工具
wget https://mirrors.openanolis.cn/anolis/migration/anolis-migration.repo -O /etc/yum.repos.d/anolis-migration.repo
yum install -y anolis-migration

# 3. 执行迁移
anolis-migration --os-release 8

# 4. 重启并验证
reboot
cat /etc/anolis-release
uname -r

# Anolis OS 8.x 保持与 CentOS 8 完全兼容
# yum/dnf 源已替换为 openanolis 源,业务无需修改即可运行

12.6 信创系统通用运维注意事项

# 1. 架构差异 (ARM64/LoongArch)
uname -m
# x86_64 / aarch64 / loongarch64

# 编译软件时指定架构
./configure --build=aarch64-linux-gnu

# 2. 包名差异
dnf search <keyword> || apt-cache search <keyword>

# 3. 安全加固 (默认更严格)
lsmod | grep -E "selinux|apparmor"
# 可能需要放宽的应用场景
semanage fcontext -a -t httpd_sys_rw_content_t "/data/app(/.*)?"
restorecon -Rv /data/app

# 4. 内核参数 (定制差异)
sysctl -a | grep -E "tcp|netfilter"

第十三章:应急响应

13.1 应急响应流程

发现 → 判断 → 止损 → 排查 → 恢复 → 复盘
(1min)(5min)  (立即)  (1h)   (2h)   (24h)

13.2 主机被入侵应急

# 1. 立即隔离 (断网)
ifdown ens33  # 或 iptables -P INPUT DROP

# 2. 保留现场关键信息
w                   # 当前登录用户
last -20            # 登录历史
lastb -20            # 失败登录
history             # 当前 shell 历史

# 3. 检查异常进程
ps auxf
ps -eo pid,ppid,user,cmd --sort=-%cpu | head -20
ls -la /proc/*/exe 2>/dev/null | grep deleted   # 已删除的可执行文件

# 4. 检查异常网络连接
ss -tanp
ss -tanp | grep ESTAB | awk '{print $5}' | sort | uniq -c | sort -rn

# 5. 检查异常文件
find / -type f -mtime -1 -ls 2>/dev/null          # 近 24h 修改
find / -type f -perm -4000 -o -perm -2000 2>/dev/null  # suid/sgid
find / -name ".*" -type f -size +1M 2>/dev/null    # 大隐藏文件

# 6. 检查定时任务
crontab -l
ls -la /var/spool/cron/
cat /etc/crontab

# 7. 检查 SSH
cat /etc/ssh/sshd_config | grep -v "^#" | grep -v "^$"
cat /root/.ssh/authorized_keys
find / -name "authorized_keys" 2>/dev/null

# 8. 检查用户变化
grep -E ":0:" /etc/passwd   # UID 0 的用户
lastlog

13.3 DDoS 攻击应急

# 1. 确认攻击特征
ss -tan | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -20
ss -tan state syn-recv | wc -l

# 2. 源 IP 封禁
iptables -I INPUT -s <IP> -j DROP

# 3. 限制单 IP 并发
iptables -A INPUT -p tcp --dport 80 -m connlimit \
    --connlimit-above 50 --connlimit-mask 32 -j DROP

# 4. SYN Cookie 加固
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.tcp_max_syn_backlog=8192

# 5. Nginx 限速
# limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
# limit_conn_zone $binary_remote_addr zone=addr:10m;

13.4 数据库被删库应急

# 1. 立即停止数据库 (不要 kill -9)
systemctl stop mysqld

# 2. 备份当前所有文件
tar -czf /backup/mysql_data_$(date +%Y%m%d_%H%M).tar.gz /var/lib/mysql/

# 3. 停止应用
systemctl stop app_service

# 4. 检查备份可用性
ls -lh /backup/mysql/

# 5. 恢复最近备份 + binlog (MySQL)
mysqlbinlog --start-datetime="2024-01-01 00:00:00" binlog.000010 > recover.sql

13.5 全站 502/503 应急

# 1. 检查后端服务
systemctl status <service>
netstat -tlnp | grep <port>

# 2. 检查资源
free -h
df -h
top -bn1 | head -5

# 3. 查看错误日志
tail -100 /var/log/nginx/error.log
tail -100 /var/log/php-fpm/error.log 2>/dev/null

# 4. 数据库连接
mysql -e "SHOW PROCESSLIST;"

# 5. 快速恢复
systemctl restart <service>
systemctl reload nginx

# 6. 临时扩容
sed -i 's/worker_connections .*/worker_connections 10240;/' /etc/nginx/nginx.conf
nginx -s reload

13.6 事件复盘模板

## 故障复盘报告

### 基本信息
- 故障时间: 2024-XX-XX XX:XX ~ XX:XX (持续 XX 分钟)
- 影响范围: XX 服务不可用 / XX 功能异常
- 影响用户: 约 XX 用户

### 故障时间线
| 时间 | 事件 |
|------|------|
| 14:30 | 监控告警触发 |
| 14:32 | 运维确认故障 |
| 14:35 | 定位原因 |
| 14:45 | 修复方案确认 |
| 14:50 | 修复完成,服务恢复 |

### 根因分析
- 直接原因: 
- 根本原因: 
- 5 Whys: ...

### 改进措施
| 序号 | 措施 | 责任人 | 截止日期 |
|------|------|--------|----------|
| 1 | | | |

第十四章:运维脚本工具集

14.1 SSH 批量管理

#!/bin/bash
# ssh_batch.sh - 批量 SSH 执行命令
# 用法: ssh_batch.sh "uptime"

HOSTS_FILE="/opt/scripts/hosts.txt"
SSH_USER="ops"
SSH_PORT="2222"
SSH_KEY="/home/ops/.ssh/id_rsa"

[ -z "$1" ] && { echo "用法: $0 <command>"; exit 1; }

while read -r host; do
    [[ -z "$host" || "$host" =~ ^# ]] && continue
    echo "===== $host ====="
    ssh -p "$SSH_PORT" -i "$SSH_KEY" -o StrictHostKeyChecking=no \
        -o ConnectTimeout=5 "$SSH_USER@$host" "$1" 2>&1
    echo
done < "$HOSTS_FILE"

14.2 SSL 证书自动检查

#!/bin/bash
# check_certs.sh - SSL 证书到期检查
# crontab: 0 8 * * * /opt/scripts/check_certs.sh

DOMAINS=(
    "example.com:443"
    "api.example.com:443"
)
ALERT_DAYS=30
WEBHOOK="https://hooks.slack.com/services/xxx"

for entry in "${DOMAINS[@]}"; do
    domain="${entry%:*}"
    port="${entry#*:}"

    expiry=$(echo | openssl s_client -servername "$domain" \
        -connect "$domain:$port" 2>/dev/null | \
        openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)

    [ -z "$expiry" ] && { echo "ERROR: $domain 无法获取"; continue; }

    expiry_ts=$(date -d "$expiry" +%s)
    now_ts=$(date +%s)
    remain_days=$(( ($expiry_ts - $now_ts) / 86400 ))

    echo "$domain: 剩余 $remain_days 天"

    if [ "$remain_days" -lt "$ALERT_DAYS" ]; then
        message="[紧急] $domain 证书将在 $remain_days 天后过期!"
        curl -X POST -H 'Content-type: application/json' \
            --data "{\"text\":\"$message\"}" "$WEBHOOK"
    fi
done

14.3 自动清理脚本

#!/bin/bash
# cleanup.sh - 系统自动清理
# crontab: 0 3 * * * /opt/scripts/cleanup.sh

LOG_FILE="/var/log/cleanup.log"
RETENTION_DAYS=30

log() { echo "[$(date '+%F %T')] $*" | tee -a "$LOG_FILE"; }

log "=== 开始清理 ==="

# 清理旧日志
find /var/log -type f \( -name "*.log.*" -o -name "*.gz" \) -mtime +$RETENTION_DAYS -delete 2>/dev/null

# journald 清理
journalctl --vacuum-size=1G --vacuum-time=${RETENTION_DAYS}d 2>/dev/null

# 清理 /tmp
find /tmp -type f -mtime +7 -delete 2>/dev/null

# 清理包管理缓存
yum clean all 2>/dev/null; apt-get clean 2>/dev/null

# 清理旧内核 (CentOS)
command -v package-cleanup &>/dev/null && package-cleanup --oldkernels --count=2 -y 2>/dev/null

# 清理 core dump
find /var/lib/systemd/coredump -type f -mtime +7 -delete 2>/dev/null

# 磁盘空间警告
DISK_USAGE=$(df / | awk 'NR==2{print $5}' | sed 's/%//')
[ "$DISK_USAGE" -gt 85 ] && log "警告: 根分区使用率 $DISK_USAGE%"

log "=== 清理完成 ==="

14.4 进程守护脚本

#!/bin/bash
# process_guard.sh - 进程守护
# crontab: */1 * * * * /opt/scripts/process_guard.sh

PROCESSES=(
    "nginx"
    "mysqld"
    "sshd"
)

for proc in "${PROCESSES[@]}"; do
    if ! pgrep -x "$proc" > /dev/null; then
        echo "[$(date)] $proc 未运行, 尝试启动..."
        systemctl restart "$proc" 2>/dev/null || systemctl start "$proc" 2>/dev/null

        sleep 3
        if pgrep -x "$proc" > /dev/null; then
            echo "[$(date)] $proc 启动成功"
        else
            echo "[$(date)] $proc 启动失败!!"
            curl -X POST -H 'Content-type: application/json' \
                --data "{\"text\":\"[$(hostname)] 进程 $proc 启动失败!\"}" \
                "https://hooks.slack.com/services/xxx"
        fi
    fi
done

14.5 全量备份脚本 (整合版)

#!/bin/bash
# full_backup.sh - 全量备份 (文件 + 数据库)
# crontab: 0 1 * * * /opt/scripts/full_backup.sh

BACKUP_BASE="/backup"
DATE=$(date +%Y%m%d)
LOG_FILE="$BACKUP_BASE/backup_$DATE.log"
RETENTION=7
BACKUP_PASS="your_backup_password"

log() { echo "[$(date '+%F %T')] $*" | tee -a "$LOG_FILE"; }

log "=== 全量备份开始 ==="
mkdir -p "$BACKUP_BASE/$DATE"

# 1. 文件备份 (restic)
export RESTIC_REPOSITORY="$BACKUP_BASE/restic"
export RESTIC_PASSWORD="$BACKUP_PASS"
log "执行 restic 备份..."
restic backup /data /etc /opt/scripts 2>&1 | tee -a "$LOG_FILE"
restic forget --keep-daily 7 --keep-weekly 4 --prune 2>&1 | tee -a "$LOG_FILE"

# 2. MySQL 备份
if command -v mysqldump &>/dev/null; then
    log "执行 MySQL 备份..."
    mysqldump --all-databases --single-transaction \
        --routines --triggers --events \
        --set-gtid-purged=OFF \
        | gzip > "$BACKUP_BASE/$DATE/mysql_all.sql.gz"
    log "MySQL 备份完成: $(ls -lh $BACKUP_BASE/$DATE/mysql_all.sql.gz | awk '{print $5}')"
fi

# 3. PostgreSQL 备份
if command -v pg_dumpall &>/dev/null; then
    log "执行 PostgreSQL 备份..."
    sudo -u postgres pg_dumpall | gzip > "$BACKUP_BASE/$DATE/postgres_all.sql.gz"
    log "PG 备份完成: $(ls -lh $BACKUP_BASE/$DATE/postgres_all.sql.gz | awk '{print $5}')"
fi

# 4. 清理旧备份
log "清理 ${RETENTION} 天前的备份..."
find "$BACKUP_BASE" -maxdepth 1 -type d -mtime +$RETENTION -exec rm -rf {} +

# 5. 远程同步
if [ -n "$REMOTE_BACKUP_HOST" ]; then
    log "同步到远程..."
    rsync -avz --delete "$BACKUP_BASE/" "backup@$REMOTE_BACKUP_HOST:/backup/$(hostname)/"
fi

log "=== 全量备份完成 ==="

第十五章:运维最佳实践

15.1 目录与命名规范

目录规范

/
├── opt/
│   ├── app/                 # 应用目录
│   │   ├── bin/             # 可执行文件
│   │   ├── conf/            # 配置文件
│   │   └── lib/             # 库文件
│   └── scripts/             # 运维脚本
├── data/                    # 应用数据 (独立分区)
│   ├── app/                 # 应用数据
│   └── backup/              # 备份
├── var/log/
│   └── app/                 # 应用日志
└── etc/
    └── app/                 # 应用配置

命名规范

类型 规范 示例
主机名 环境-业务-序号 prod-web-01, stg-db-02
DNS 服务.环境.域名 api.prod.example.com
端口 统一规划 Web: 80xx, API: 81xx, DB: 3306/5432
用户 app_<name> app_web, app_worker
备份文件 type_YYYYMMDD mysql_all_20240101.sql.gz
脚本命名 动词_对象 start_app.sh, backup_db.sh

15.2 变更管理

# 变更前
# 1. 备份当前配置
cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.$(date +%Y%m%d_%H%M)

# 2. 灰度验证路径
# 测试环境 → 预发布 → 生产一台 → 全量

# 3. 变更记录日志
# /opt/docs/changelog.md

# 变更后
# 4. 保留回滚方案
# 一键回滚: cp /etc/nginx/nginx.conf.OLD /etc/nginx/nginx.conf && nginx -s reload

# 5. 验证
curl -I https://example.com
ansible all -m shell -a "systemctl is-active nginx"

15.3 监控告警最佳实践

1. 告警分级
   P0 (紧急): 核心服务不可用 → 电话 + 即时消息
   P1 (严重): 核心功能降级 → 即时消息
   P2 (警告): 预警指标 → 邮件 / 群消息
   P3 (通知): 信息性 → 群消息 (静默)

2. 告警设计原则
   - 每条告警必须可操作 (不能是"指标高了")
   - 告警必须有 runbook (怎么处理)
   - 避免告警疲劳 (同一问题聚合)
   - 静默期 (维护窗口)

3. 值班制度
   - 主值 + 备值
   - 明确升级路径
   - 记录每次告警的处理情况

15.4 日常巡检清单

#!/bin/bash
# daily_check.sh - 每日巡检

echo "===== 每日巡检: $(date) ====="

echo -e "\n--- 系统状态 ---"
uptime
free -h | grep -E "^Mem|^Swap"

echo -e "\n--- 磁盘 ---"
df -h | grep -vE "^tmpfs|^devtmpfs|^overlay"

echo -e "\n--- 最近错误日志 ---"
journalctl -p err --since "24 hours ago" --no-pager | tail -20

echo -e "\n--- 关键服务 ---"
for svc in nginx mysqld sshd postgresql docker; do
    if systemctl is-active --quiet $svc 2>/dev/null; then
        echo "  ✓ $svc active"
    elif systemctl is-enabled --quiet $svc 2>/dev/null; then
        echo "  ✗ $svc INACTIVE!"
    fi
done

echo -e "\n--- 备份检查 ---"
ls -lh /backup/ | tail -5

echo -e "\n--- 连接状态 ---"
ss -s

echo -e "\n--- 最近登录 ---"
last -5

15.5 运维能力矩阵

能力等级 技能要求 典型工具
初级 Linux 基础命令、服务启停、简单排错 top, journalctl, systemctl
中级 监控搭建、自动化部署、性能调优、安全加固 Prometheus, Ansible, Nginx HA
高级 架构设计、灾难恢复、全链路压测、信创适配 K8s, Terraform, MySQL HA
专家 多活架构、运维平台开发、SRE 体系、成本优化 自研运维平台, eBPF, 混沌工程

参考资源: 本手册与仓库中的以下手册配合使用效果更佳:

  • Linux-使用手册.md — Linux 基础命令与发行版对比
  • Docker-使用手册.md — Docker 容器化
  • Kubernetes-使用手册.md — K8s 编排
  • Nginx-使用手册.md — Nginx 使用
  • MySQL-使用手册.md — MySQL 数据库
  • PostgreSQL-使用手册.md — PostgreSQL 数据库