跳转至

Prometheus + Grafana 监控体系(结合 Ansible 的企业落地)

本文在“Ansible 企业实战”同一虚拟企业场景下,构建一套通用的指标监控与告警体系:Prometheus + Alertmanager + Grafana,并通过 Ansible 自动化完成安装与配置,覆盖 Exporter 部署、采集与规则、仪表板与告警联动。


1. 目标与范围

  • 目标:统一采集系统指标、服务可用性与业务探活,集中展示并告警
  • 组件:
  • Node Exporter(主机指标)
  • Blackbox Exporter(HTTP/TCP/ICMP 探测)
  • Prometheus Server(抓取与规则/记录/告警)
  • Alertmanager(路由与通知)
  • Grafana(可视化与看板)
  • 原则:以 Ansible 角色批量部署,幂等、自描述、可回滚

2. Ansible 角色结构(示例)

roles/
├─ exporters_node/
│  └─ tasks/main.yml  # 安装 Node Exporter
├─ exporters_blackbox/
│  └─ tasks/main.yml  # 安装 Blackbox Exporter
├─ prometheus_server/
│  ├─ tasks/main.yml
│  └─ templates/{prometheus.yml.j2, rules.yml.j2}
├─ alertmanager/
│  ├─ tasks/main.yml
│  └─ templates/alertmanager.yml.j2
└─ grafana/
   └─ tasks/main.yml

3. Exporter 部署(片段)

Node Exporter(roles/exporters_node/tasks/main.yml):

- name: 安装 node_exporter
  ansible.builtin.unarchive:
    src: https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
    dest: /opt/
    remote_src: yes
- name: systemd unit
  ansible.builtin.copy:
    dest: /etc/systemd/system/node_exporter.service
    content: |
      [Unit]
      Description=Node Exporter
      [Service]
      ExecStart=/opt/node_exporter-1.8.1.linux-amd64/node_exporter
      [Install]
      WantedBy=multi-user.target
- ansible.builtin.systemd:
    name: node_exporter
    state: started
    enabled: true

Blackbox Exporter(roles/exporters_blackbox/tasks/main.yml):

- name: 安装 blackbox_exporter
  ansible.builtin.unarchive:
    src: https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz
    dest: /opt/
    remote_src: yes
- name: systemd
  ansible.builtin.copy:
    dest: /etc/systemd/system/blackbox_exporter.service
    content: |
      [Unit]
      Description=Blackbox Exporter
      [Service]
      ExecStart=/opt/blackbox_exporter-0.25.0.linux-amd64/blackbox_exporter --config.file=/opt/blackbox.yml
      [Install]
      WantedBy=multi-user.target
- name: 配置黑盒模块
  ansible.builtin.copy:
    dest: /opt/blackbox.yml
    content: |
      modules:
        http_2xx:
          prober: http
          timeout: 5s
- ansible.builtin.systemd:
    name: blackbox_exporter
    state: started
    enabled: true

4. Prometheus Server 与规则

roles/prometheus_server/templates/prometheus.yml.j2:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'nodes'
    static_configs:
      - targets: {{ groups['all'] | map('extract', hostvars, ['ansible_host']) | list | to_nice_json }}
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets: ['https://www.example.com','http://10.10.10.11']
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: 127.0.0.1:9115

rules.yml.j2(记录与告警片段):

groups:
- name: host.rules
  rules:
  - record: node:cpu_util:avg5m
    expr: avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
  - alert: HostHighCPU
    expr: node:cpu_util:avg5m > 0.8
    for: 10m
    labels: {severity: warning}
    annotations:
      summary: "高 CPU 占用 {{ $labels.instance }}"

roles/prometheus_server/tasks/main.yml:

- name: 创建目录
  ansible.builtin.file: { path: /etc/prometheus, state: directory }
- name: 部署配置
  ansible.builtin.template: { src: prometheus.yml.j2, dest: /etc/prometheus/prometheus.yml }
- name: 部署规则
  ansible.builtin.template: { src: rules.yml.j2, dest: /etc/prometheus/rules.yml }
- name: 部署二进制(略:同 exporter 下载解压)
- name: systemd unit
  ansible.builtin.copy:
    dest: /etc/systemd/system/prometheus.service
    content: |
      [Unit]
      Description=Prometheus
      [Service]
      ExecStart=/opt/prometheus/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle
      [Install]
      WantedBy=multi-user.target
- ansible.builtin.systemd: { name: prometheus, state: started, enabled: true }

5. Alertmanager 与通知

roles/alertmanager/templates/alertmanager.yml.j2:

route:
  receiver: default
receivers:
- name: default
  email_configs:
  - to: ops@example.com
    from: monitor@example.com
    smarthost: smtp.example.com:25

roles/alertmanager/tasks/main.yml:

- name: 安装并配置 alertmanager(同 prometheus 略)
- name: systemd 启动
  ansible.builtin.systemd: { name: alertmanager, state: started, enabled: true }

6. Grafana 部署

简易法(Docker):

docker run -d --name=grafana -p 3000:3000 grafana/grafana:10.4.1

或使用包管理器:

# Debian/Ubuntu 示例
sudo apt-get install -y adduser libfontconfig1
wget https://dl.grafana.com/oss/release/grafana_10.4.1_amd64.deb
sudo dpkg -i grafana_10.4.1_amd64.deb
sudo systemctl enable --now grafana-server

数据源配置(Grafana UI 或 Provisioning):

apiVersion: 1
datasources:
- name: Prometheus
  type: prometheus
  url: http://prometheus:9090
  access: proxy
  isDefault: true

导入常用看板:Node Exporter Full(ID: 1860)等。


7. 校验与运维

# Prometheus 配置热加载
curl -X POST http://localhost:9090/-/reload
# 规则检查
promtool check rules /etc/prometheus/rules.yml
# Exporter/指标探查
curl -s localhost:9100/metrics | head
curl -s "localhost:9115/probe?module=http_2xx&target=http://10.10.10.11" | head

8. 与 Ansible 的一致性

  • 使用同一 inventory 与分组;将业务主机自动纳入抓取目标
  • 以角色方式管理配置,支持多环境(prod/dev)与幂等
  • 结合变更窗口与 --limit/--tags 做逐步发布与回滚
  • 将监控作为发布流水线的验收步骤之一(可自动化探测与告警抑制)