[夜莺监控系列6]夜莺生产环境部署-4
1 blackbox-exporter
1.1 前言
blackbox_exporter 是Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集。 blackbox_exporter 可用于以下探测:
- HTTP 探测:定义Request Header信息、判断 Http status/Http Respones Header/HttpBody 内容
- TCP 探测:业务组件端口状态监听、应用层协议定义与监听
- ICMP 探测:主机探活机制
- POST 探测:接口联通性
- SSL证书时间探测
GitHub地址:github.com/prometheus/…
1.2 下载
当前提供的helm安装,如果需要二进制安装可以到github.com/prometheus/…
# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
# helm fetch prometheus-community/prometheus-blackbox-exporter --version 7.7.0
# tar xf prometheus-blackbox-exporter-7.7.0.tgz
1.3 配置
# vim values.yaml
hostNetwork: true # 这里配置为hostNetwork方式,方便pod基于宿主机网络进行访问
image:
repository: prom/blackbox-exporter
tag: v0.23.0
pullPolicy: IfNotPresent
securityContext:
#runAsUser: 1000
runAsUser: 0
#runAsGroup: 1000
runAsGroup: 0
#readOnlyRootFilesystem: true
readOnlyRootFilesystem: false
#runAsNonRoot: true
runAsNonRoot: false
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
resources: {}
# limits:
# memory: 300Mi
# requests:
# memory: 50Mi
service:
#type: ClusterIP
type: NodePort
port: 9115
nodePort: 30015
containerPort: 9115
replicas: 1
config:
modules:
http_2xx:
prober: http
timeout: 15s # 这里从5s改为15s,解决请求超时获取响应值异常问题
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
follow_redirects: true
preferred_ip_protocol: "ip4"
icmp: # 此处新增icmp/tcp_connect/http_post_2xx三个module,是单独给http_post、ping和端口探测使用的
prober: icmp
timeout: 5s
tcp_connect:
prober: tcp
timeout: 5s
http_post_2xx:
prober: http
timeout: 15s # 这里从5s改为15s,解决请求超时获取响应值异常问题
http:
method: POST
hostAliases: []
# vim templates/service.yaml
spec:
type: {{ .Values.service.type }}
ports:
- port: {{ .Values.service.port }}
nodePort: {{ .Values.service.nodePort }} # 添加nodePort,用于后面浏览器访问blackbox的web页面
1.4 安装
# helm upgrade -i prometheus-blackbox-exporter -n monitoring .
1.5 跨网段访问问题
如果公司有多个k8s集群,且网段不同,互相之间访问不通,这会导致blackbox无法进行探测,这里有2个解决方案:
- 方案1、将几个集群之间的网络打通,这可能需要涉及一些网络方面的操作,这里就不赘述了。
- 方案2、将当前集群安装openvpn-client连接到其他集群,从而解决各网络之间的互通。
1.6 配置prometheus接入blackbox-exporter
在主n9e的Prometheus的configmap中配置如下两部分:
- 1、定义http、tcp、ping、ssl探测的job_name
# vim nightingle/templates/prometheus/configmap.yaml
# http_get监控
- job_name: "blackbox_http_get_status"
scrape_interval: 15s
metrics_path: /probe
params:
module: [http_2xx]
file_sd_configs:
- refresh_interval: 1m
files:
- "http_get_url.yml" # 通过读取yml进行http url监听
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: prometheus-blackbox-exporter:9115 # 这个必须为blackbox-exporter的service地址及端口,否则无法将请求转发到blackbox上
# ssl证书到期监控
- job_name: "blackbox_ssl_expiry_status"
scrape_interval: 15s
metrics_path: /probe
params:
module: [http_2xx]
file_sd_configs:
- refresh_interval: 1m
files:
- "ssl_expiry_url.yml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: prometheus-blackbox-exporter:9115
# 端口监控
- job_name: "blackbox_port_status"
scrape_interval: 15s
metrics_path: /probe
params:
module: [tcp_connect]
file_sd_configs:
- refresh_interval: 1m
files:
- "tcp.yml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: prometheus-blackbox-exporter:9115
# ping监控
- job_name: "blackbox_ping_status"
scrape_interval: 15s
metrics_path: /probe
params:
module: [icmp]
file_sd_configs:
- refresh_interval: 1m
files:
- "ping.yml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: prometheus-blackbox-exporter:9115
- 2、配置 http、tcp、ping、ssl探测 的yml内容
# vim nightingle/templates/prometheus/configmap.yaml
data:
http_get_url.yml: | # 文件名称,用于被上面的job_name中的files引用的
###################业务服务#################
- targets:
- 'http://www.example.com' # 此为http get探测的url地址
labels: # labels中的内容用于作为promsql查询后携带上的标签,都是自己定义的标签,可以根据需求和场景进行定义
externalip: '1.1.1.1'
internalip: '172.20.0.1'
service: "example"
###################运维服务#################
- targets:
- 'http://jenkins.example.com/login?from=%2F'
labels:
externalip: '1.1.1.2'
internalip: '172.20.0.2'
service: "jenkins"
###################监控相关#################
- targets:
- 'http://n9e.example.com.cn'
labels:
externalip: '1.1.1.3'
internalip: '172.20.0.3'
service: "n9e"
hostname: "k8s-ops-1"
ssl_expiry_url.yml: |
###################ssl-expiry#################
- targets:
- 'https://www.example.com'
labels:
externalip: '1.1.1.1'
internalip: '172.20.0.1'
service: "example"
ping.yml: |
###################移动云#################
- targets:
- '172.20.0.3'
labels:
internalip: '172.20.0.3'
externalip: '1.1.1.3'
hostname: 'k8s-ops-1'
tcp.yml: |
###################监控服务端口#################
- targets:
- '172.20.0.3:30007'
labels:
internalip: '172.20.0.2'
hostname: 'k8s-ops-1'
service: "n9e"
配置完成后,重新更新n9e服务后,重启 nightingale-prometheus-0 pod用于重载configmap中的prometheus.yaml ,然后到prometheus的targets中查看信息(如下图)
1.7 blackbox中验证
浏览器访问: IP:30015,效果如下:
1.8 出图
- grafana模板ID:9965
- 官方模板地址:grafana.com/grafana/das…
1.9 告警
- http-get探测
名称: http-get探测
PromSQL: avg_over_time(probe_success{job="blackbox_http_get_status"}[5m]) < 0.5
alert_type=probe
level: 2
- ssl-expiry探测
名称: ssl-expiry探测
备注: ssl证书即将过期
PromSQL: (probe_ssl_earliest_cert_expiry{job="blackbox_ssl_expiry_status"}-time())/3600/24 < 15
alert_type=probe
level: 2
- ping探测
名称: ping探测
PromSQL: probe_success{job="blackbox_ping_status"} == 0
alert_type=probe
level: 2
2 mysql数据备份
由于n9e的告警、大盘模板、用户信息等等数据都放到了mysql中(如下图),所以mysql的数据一定要进行备份(如果放到云厂商的,一般都有备份,所以还是比较安心的)。
2.1 mysql server安装
# mv /etc/apt/sources.list /etc/apt/sources.list.bak
# vim /etc/apt/sources.list
deb http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse
# apt-get update
# apt-get install mysql-server
# systemctl status mysql.service
2.2 mysql备份脚本
#!/bin/sh
backupDir="/data/backup/n9e-data"
backupLog="${backupDir}/backup.log"
backupScriptDir="/usr/local/monitoring/"
user="root"
password="123456"
host="172.20.0.2"
port="30006"
lock="--single-transaction"
function backup_sql(){
timeNow=`date +%Y%m%d-%H%M%S`
dbname=$1
backupName="${dbname}-${timeNow}.sql"
# --column-statistics=0 用于高版本mysqldump备份低版本mysql的报错问题
/usr/bin/mysqldump -h${host} -P${port} -u${user} -p${password} $lock --default-character-set=utf8 --flush-logs --column-statistics=0 -R $dbname >> ${backupDir}/${backupName}
if [[ $? == 0 ]];then
cd $backupDir
tar zcf $backupName.tar.gz $backupName
md5sum $backupName.tar.gz > $backupName.tar.gz.md5
size=$(du $backupName.tar.gz -sh | awk '{print $1}')
rm -f ${backupDir}/${backupName}
echo "Info: [`date +%Y%m%d-%H%M%S`] backup [$dbname]($size) successful !" >> ${backupLog}
else
cd $backupDir
#rm -rf $backupName
echo "Error: [`date +%Y%m%d-%H%M%S`] backup [$dbname] fail !" >> ${backupLog}
fi
}
echo -e "\n===================" >> ${backupLog}
echo -e "`date +%Y-%m-%d\ %H:%M:%S`" >> ${backupLog}
echo -e "===================" >> ${backupLog}
backup_sql "n9e_v5"
backup_sql "ibex"
echo "Info: [`date +%Y%m%d-%H%M%S`] backup finished. log path: ${backupLog}, file path: ${backupDir}"
2.3 定时任务
定时任务:
# crontab -l
0 */12 * * * /bin/sh /usr/local/monitoring/n9e-database-backup.sh
手动执行测试:
# /bin/sh /usr/local/monitoring/n9e-database-backup.sh
Info: [20230727-164245] backup finished. log path: /data/backup/n9e-data/backup.log, file path: /data/backup/n9e-data
# cat /data/backup/n9e-data/backup.log
===================
2023-07-27 16:41:28
===================
Info: [20230727-164135] backup [n9e_v5](2.4M) successful !
Info: [20230727-164142] backup [ibex](20K) successful !
===================
2023-07-27 16:42:32
===================
Info: [20230727-164239] backup [n9e_v5](2.4M) successful !
Info: [20230727-164245] backup [ibex](20K) successful !
以上就是本章节的内容了,n9e相关的主要内容基本上就到这里了。 后续我会将n9e的一些使用体验、理解、优化方案、json模板等内容也分享出来~
非常感谢大家的关注,如有任何问题请留言,我会第一时间回复,再次感谢 么么哒~