[夜莺监控系列6]夜莺生产环境部署-4

1 blackbox-exporter

1.1 前言

blackbox_exporter 是Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集。 blackbox_exporter 可用于以下探测:

  • HTTP 探测:定义Request Header信息、判断 Http status/Http Respones Header/HttpBody 内容
  • TCP 探测:业务组件端口状态监听、应用层协议定义与监听
  • ICMP 探测:主机探活机制
  • POST 探测:接口联通性
  • SSL证书时间探测

GitHub地址:github.com/prometheus/…

1.2 下载

当前提供的helm安装,如果需要二进制安装可以到github.com/prometheus/…

# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# helm fetch prometheus-community/prometheus-blackbox-exporter --version 7.7.0
# tar xf prometheus-blackbox-exporter-7.7.0.tgz

1.3 配置

# vim values.yaml
hostNetwork: true  # 这里配置为hostNetwork方式,方便pod基于宿主机网络进行访问
image:
  repository: prom/blackbox-exporter
  tag: v0.23.0
  pullPolicy: IfNotPresent

securityContext:
  #runAsUser: 1000
  runAsUser: 0
  #runAsGroup: 1000
  runAsGroup: 0
  #readOnlyRootFilesystem: true
  readOnlyRootFilesystem: false
  #runAsNonRoot: true
  runAsNonRoot: false
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

resources: {}
  # limits:
  #   memory: 300Mi
  # requests:
  #   memory: 50Mi

service:
  #type: ClusterIP
  type: NodePort
  port: 9115
  nodePort: 30015
containerPort: 9115

replicas: 1

config:
  modules:
    http_2xx:
      prober: http
      timeout: 15s # 这里从5s改为15s,解决请求超时获取响应值异常问题
      http:
        valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
        follow_redirects: true
        preferred_ip_protocol: "ip4"
    icmp:            # 此处新增icmp/tcp_connect/http_post_2xx三个module,是单独给http_post、ping和端口探测使用的
      prober: icmp
      timeout: 5s
    tcp_connect:
      prober: tcp
      timeout: 5s
    http_post_2xx:
      prober: http
      timeout: 15s # 这里从5s改为15s,解决请求超时获取响应值异常问题
      http:
        method: POST

hostAliases: []


# vim templates/service.yaml
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      nodePort: {{ .Values.service.nodePort }}  # 添加nodePort,用于后面浏览器访问blackbox的web页面

1.4 安装

# helm upgrade -i prometheus-blackbox-exporter -n monitoring .

1.5 跨网段访问问题

如果公司有多个k8s集群,且网段不同,互相之间访问不通,这会导致blackbox无法进行探测,这里有2个解决方案:

  • 方案1、将几个集群之间的网络打通,这可能需要涉及一些网络方面的操作,这里就不赘述了。
  • 方案2、将当前集群安装openvpn-client连接到其他集群,从而解决各网络之间的互通。

1.6 配置prometheus接入blackbox-exporter

在主n9e的Prometheus的configmap中配置如下两部分:

  • 1、定义http、tcp、ping、ssl探测的job_name
# vim nightingle/templates/prometheus/configmap.yaml
      # http_get监控
      - job_name: "blackbox_http_get_status"
        scrape_interval: 15s
        metrics_path: /probe
        params:
          module: [http_2xx]
        file_sd_configs:
        - refresh_interval: 1m
          files:
          - "http_get_url.yml"  # 通过读取yml进行http url监听
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: prometheus-blackbox-exporter:9115  # 这个必须为blackbox-exporter的service地址及端口,否则无法将请求转发到blackbox上

      # ssl证书到期监控
      - job_name: "blackbox_ssl_expiry_status"
        scrape_interval: 15s
        metrics_path: /probe
        params:
          module: [http_2xx]
        file_sd_configs:
        - refresh_interval: 1m
          files:
          - "ssl_expiry_url.yml"

        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: prometheus-blackbox-exporter:9115

      # 端口监控
      - job_name: "blackbox_port_status"
        scrape_interval: 15s
        metrics_path: /probe
        params:
          module: [tcp_connect]
        file_sd_configs:
        - refresh_interval: 1m
          files:
          -  "tcp.yml"
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: prometheus-blackbox-exporter:9115

      # ping监控
      - job_name: "blackbox_ping_status"
        scrape_interval: 15s
        metrics_path: /probe
        params:
          module: [icmp]
        file_sd_configs:
        - refresh_interval: 1m
          files:
          -  "ping.yml"
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: prometheus-blackbox-exporter:9115
  • 2、配置 http、tcp、ping、ssl探测 的yml内容
# vim nightingle/templates/prometheus/configmap.yaml
data:
  http_get_url.yml: |  # 文件名称,用于被上面的job_name中的files引用的
    ###################业务服务#################
    - targets:
      - 'http://www.example.com'  # 此为http get探测的url地址
      labels:   # labels中的内容用于作为promsql查询后携带上的标签,都是自己定义的标签,可以根据需求和场景进行定义
        externalip: '1.1.1.1'
        internalip: '172.20.0.1'
        service: "example"
    ###################运维服务#################
    - targets:
      - 'http://jenkins.example.com/login?from=%2F'
      labels:
        externalip: '1.1.1.2'
        internalip: '172.20.0.2'
        service: "jenkins"
    ###################监控相关#################
    - targets:
      - 'http://n9e.example.com.cn'
      labels:
        externalip: '1.1.1.3'
        internalip: '172.20.0.3'
        service: "n9e"
        hostname: "k8s-ops-1"

  ssl_expiry_url.yml: |
    ###################ssl-expiry#################
    - targets:
      - 'https://www.example.com'
      labels:
        externalip: '1.1.1.1'
        internalip: '172.20.0.1'
        service: "example"

  ping.yml: |
    ###################移动云#################
    - targets:
      - '172.20.0.3'
      labels:
        internalip: '172.20.0.3'
        externalip: '1.1.1.3'
        hostname: 'k8s-ops-1'


  tcp.yml: |
    ###################监控服务端口#################
    - targets:
      - '172.20.0.3:30007'
      labels:
        internalip: '172.20.0.2'
        hostname: 'k8s-ops-1'
        service: "n9e"

配置完成后,重新更新n9e服务后,重启 nightingale-prometheus-0 pod用于重载configmap中的prometheus.yaml ,然后到prometheus的targets中查看信息(如下图)

image.png

1.7 blackbox中验证

浏览器访问: IP:30015,效果如下:

image.png

1.8 出图

1.9 告警

  • http-get探测
名称: http-get探测
PromSQL: avg_over_time(probe_success{job="blackbox_http_get_status"}[5m]) < 0.5
alert_type=probe
level: 2
  • ssl-expiry探测
名称: ssl-expiry探测
备注: ssl证书即将过期
PromSQL: (probe_ssl_earliest_cert_expiry{job="blackbox_ssl_expiry_status"}-time())/3600/24 < 15
alert_type=probe
level: 2
  • ping探测
名称: ping探测
PromSQL: probe_success{job="blackbox_ping_status"} == 0
alert_type=probe
level: 2

2 mysql数据备份

由于n9e的告警、大盘模板、用户信息等等数据都放到了mysql中(如下图),所以mysql的数据一定要进行备份(如果放到云厂商的,一般都有备份,所以还是比较安心的)。

image.png

2.1 mysql server安装

# mv /etc/apt/sources.list /etc/apt/sources.list.bak
# vim /etc/apt/sources.list
deb http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse

# apt-get update

# apt-get install mysql-server
# systemctl status mysql.service

2.2 mysql备份脚本

#!/bin/sh
backupDir="/data/backup/n9e-data"
backupLog="${backupDir}/backup.log"
backupScriptDir="/usr/local/monitoring/"

user="root"
password="123456"
host="172.20.0.2"
port="30006"

lock="--single-transaction"

function backup_sql(){
  timeNow=`date +%Y%m%d-%H%M%S`
  dbname=$1
  backupName="${dbname}-${timeNow}.sql"
  # --column-statistics=0 用于高版本mysqldump备份低版本mysql的报错问题
  /usr/bin/mysqldump -h${host} -P${port} -u${user} -p${password} $lock --default-character-set=utf8 --flush-logs --column-statistics=0 -R $dbname >> ${backupDir}/${backupName}
  if [[ $? == 0 ]];then
    cd $backupDir
    tar zcf $backupName.tar.gz $backupName
    md5sum $backupName.tar.gz > $backupName.tar.gz.md5
    size=$(du $backupName.tar.gz -sh | awk '{print $1}')
    rm -f ${backupDir}/${backupName}
    echo "Info: [`date +%Y%m%d-%H%M%S`] backup [$dbname]($size) successful !" >> ${backupLog}
  else
    cd $backupDir
    #rm -rf $backupName
    echo "Error: [`date +%Y%m%d-%H%M%S`] backup [$dbname] fail !"  >> ${backupLog}
  fi
}

echo -e "\n===================" >> ${backupLog}
echo -e "`date +%Y-%m-%d\ %H:%M:%S`" >> ${backupLog}
echo -e "===================" >> ${backupLog}
backup_sql "n9e_v5"
backup_sql "ibex"
echo "Info: [`date +%Y%m%d-%H%M%S`] backup finished. log path: ${backupLog}, file path: ${backupDir}"

2.3 定时任务

定时任务:
# crontab -l
0 */12 * * * /bin/sh /usr/local/monitoring/n9e-database-backup.sh


手动执行测试:
# /bin/sh /usr/local/monitoring/n9e-database-backup.sh
Info: [20230727-164245] backup finished. log path: /data/backup/n9e-data/backup.log, file path: /data/backup/n9e-data

# cat /data/backup/n9e-data/backup.log
===================
2023-07-27 16:41:28
===================
Info: [20230727-164135] backup [n9e_v5](2.4M) successful !
Info: [20230727-164142] backup [ibex](20K) successful !

===================
2023-07-27 16:42:32
===================
Info: [20230727-164239] backup [n9e_v5](2.4M) successful !
Info: [20230727-164245] backup [ibex](20K) successful !

以上就是本章节的内容了,n9e相关的主要内容基本上就到这里了。 后续我会将n9e的一些使用体验、理解、优化方案、json模板等内容也分享出来~

非常感谢大家的关注,如有任何问题请留言,我会第一时间回复,再次感谢 么么哒~

全部评论

相关推荐

斑驳不同:还为啥暴躁 假的不骂你骂谁啊
点赞 评论 收藏
分享
11-11 14:21
西京学院 C++
Java抽象练习生:教育背景放最前面,不要耍小聪明
点赞 评论 收藏
分享
点赞 收藏 评论
分享
牛客网
牛客企业服务