Prometheus
- 安装prometheus-operator
wget https://github.com/prometheus-operator/prometheus-operator/releases/download/v0.64.0/bundle.yaml kubectl create -f bundle.yaml
- 创建示例应用
apiVersion: apps/v1 kind: Deployment metadata: name: example-app spec: replicas: 3 selector: matchLabels: app: example-app template: metadata: labels: app: example-app spec: containers: - name: example-app image: fabxc/instrumented_app ports: - name: web containerPort: 8080 --- apiVersion: v1 kind: Service metadata: labels: app: example-app name: example-app spec: ports: - name: 8080-8080 port: 8080 protocol: TCP targetPort: 8080 selector: app: example-app type: NodePort
- 创建Service和Pod监控对象
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: example-app labels: team: frontend spec: selector: matchLabels: app: example-app endpoints: - port: 8080-8080 --- apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: example-app labels: team: frontend spec: selector: matchLabels: app: example-app podMetricsEndpoints: - port: web
- 部署Prometheus和监控示例应用的Service和Pod
如果在K8S集群上激活了RBAC授权,则必须先创建RBAC规则,并且提前获得Prometheus服务帐户。
接下来创建服务帐户和所需的集群角色和集群角色绑定:
apiVersion: v1 kind: ServiceAccount metadata: name: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - nodes/metrics - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - configmaps verbs: ["get"] - apiGroups: - networking.k8s.io resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: default
- 创建Prometheus对象,并定义选择监控指定标签的ServiceAccount和PodMonitor,最后暴露Prometheus
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus spec: serviceAccountName: prometheus replicas: 3 serviceMonitorSelector: matchLabels: team: frontend podMonitorSelector: matchLabels: team: frontend resources: requests: memory: 400Mi enableAdminAPI: false # 如果要开启管理API可设置为true --- apiVersion: v1 kind: Service metadata: name: prometheus spec: type: NodePort ports: - name: web nodePort: 30900 port: 9090 protocol: TCP targetPort: web selector: prometheus: prometheus
Alertmanager
对于警报组件,Prometheus Operator引入了2个资源对象:
- Alertmanager资源对象,它的作用是允许用户以声明的方式描述警报管理器群集。
- AlertmanagerConfig资源对象,它的作用是允许用户以声明方式描述警报管理器配置。
先准备好警报管理器的配置,也就是创建AlertmanagerConfig资源对象。接着部署有3个副本的警报管理器集群,并使用该警报管理器的配置,最后暴露警报管理器
apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: config-example labels: alertmanagerConfig: example spec: route: groupBy: ['...'] groupWait: 1s groupInterval: 1s repeatInterval: 1000d receiver: 'webhook' receivers: - name: 'webhook' webhookConfigs: - url: 'http://192.168.11.254:5001/webhook' # 接收到的警报还要往这个API发送,接收告警的API请见下面的webhook.py代码 sendResolved: true --- apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: example spec: replicas: 3 alertmanagerConfiguration: # 此处为全局配置 name: config-example --- apiVersion: v1 kind: Service metadata: name: alertmanager-example spec: type: NodePort ports: - name: web nodePort: 30903 port: 9093 protocol: TCP targetPort: web selector: alertmanager: example
接收警报消息的py代码:
import json from flask import Flask, request app = Flask(__name__) @app.route('/webhook', methods=['POST']) def send(): try: data = json.loads(request.data) print(data) except Exception as e: print(e) return 'finish ok ...' if __name__ == '__main__': app.run(debug=False, host='0.0.0.0', port=5001)
接收到的警报消息之所以要推给另外的接口,其目的是还可以对警报消息做进一步的处理。然后再往其它地方推送警报,比如钉钉、邮件等等。
Prometheus和Alertmanager整合
在之前创建Prometheus对象的yaml文件的基础上,拿过来改造:
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus spec: serviceAccountName: prometheus replicas: 3 alerting: # 主要是改造这里,此处与警报管理器整合 alertmanagers: - namespace: default name: alertmanager-example port: web serviceMonitorSelector: matchLabels: team: frontend podMonitorSelector: matchLabels: team: frontend resources: requests: memory: 400Mi enableAdminAPI: false --- apiVersion: v1 kind: Service metadata: name: prometheus spec: type: NodePort ports: - name: web nodePort: 30900 port: 9090 protocol: TCP targetPort: web selector: prometheus: prometheus
整合后,在Prometheus的页面中 Status > Runtime & Build Information 下会看到已经发现到了3个警报管理器实例
如果没有成功发现,请检查配置是否正确。
Rule
规则发现机制:
- 默认情况下,Prometheus资源仅发现在同一命名空间中的规则
- 默认情况下,如果spec.ruleSelector字段为nil,则表示不匹配任何规则
- 要发现所有命名空间中的规则,可以给ruleNamespaceSelector字段传空字典,例如ruleNamespaceSelector: {}
- 若要从与特定标签匹配的所有命名空间中发现规则,可以使用matchLabels
- 创建警报规则
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: null labels: prometheus: example role: alert-rules name: prometheus-example-rules spec: groups: - name: ./example.rules rules: - alert: ExampleAlert expr: vector(1)
出于演示的目的,这条规则始终会触发警报。方便验证是否正常工作。
- 开始部署Prometheus规则
之前已经将Prometheus和Alertmanageryaml进行整合,接下来,继续在这个yaml的基础上改造。
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus spec: serviceAccountName: prometheus replicas: 3 alerting: alertmanagers: - namespace: default name: alertmanager-example port: web serviceMonitorSelector: matchLabels: team: frontend podMonitorSelector: matchLabels: team: frontend resources: requests: memory: 400Mi enableAdminAPI: false ruleSelector: matchLabels: role: alert-rules prometheus: example ruleNamespaceSelector: {} # 要为PrometheusRules发现选择的命名空间。如果未指定,则仅使用与普罗米修斯对象所在的命名空间相同的命名空间。 --- apiVersion: v1 kind: Service metadata: name: prometheus spec: type: NodePort ports: - name: web nodePort: 30900 port: 9090 protocol: TCP targetPort: web selector: prometheus: prometheus
在Prometheus的页面中 Status > Rules 下会看到这条示例规则
在Alertmanager的页面中,也有了警报消息,说明警报组件已经接收到了Prometheus发送过来的警报,不过貌似时区有点不对。
综合测试
测试步骤:
- 修改或添加警报规则
- 观察Prometheus是否能发现到新的规则
- 警报触发后观察Alertmanager能否正常接收到警报
- 警报触发后观察Alertmanager能否正常推送到其它接口
测试结果截图: