前言
Terraform是有HashiCorp公司开源的IT基础架构的自动化编排工具,“Write,Plan and Create Infrastructure as Code”,Terraform的命令行接口(CLI)提供了一种简单的机制,用于将配置文件部署到阿里云或者其他任意支持的云上,并对其进行版本控制。
SLS告警告警是一站式告警监控、降噪、事务管理、通知分派的智能运维平台。包含日志/时序存储、告警监控、告警管理、通知管理等模块;强大的功能当然也有自动化配置的需求,本文将介绍如何使用Terraform进行简单的配置,即可完成在无界面的告警配置。
安装和配置Terraform
Terraform安装和配置可以参考阿里云Terraform的官方链接,并且在Cloud Shell中已经集成Terraform命令行。
SLS告警相关资源介绍
SLS告警主要涉及三类操作:
- 告警资源初始化
- 告警监控规则管理
- 告警策略/资源数据管理
告警资源初始化
- 初始化告警资源
- 中心Project:名称为sls-alert-{uid}-{region},其中uid为阿里云主账号,region为用户指定的中心Project地域
- 中心Logstore:名称为internal-alert-center-log,该logstore挂载在中心Project下,为免费Logstore,主要用来存储告警在执行过程中的执行历史和诊断信息。
- 内置告警仪表盘:包括全局告警排障中心,全局告警链路中心,全局告警规则中心,开放告警中心。
- 每个阿里云主账号只需要初始化一次即可;多次操作幂等。
- 初始化Project告警资源
- 告警监控规则必须挂载在SLS的某个Project下,在某个Project下创建告警规则之前,需要先初始化Project下的告警资源
- 告警历史统计Logstore:名称为internal-alert-history,该Logstore为免费Logstore,存储了当前Project下所有告警规则的评估历史记录,包括每次评估的状态和告警触发状态。
- 内置告警历史统计仪表盘:名称为internal-alert-analysis,仪表盘为内置仪表盘,展示了告警监控规则的执行历史成功率等。
- 每个Project只需要初始化一次即可;多次操作幂等。
告警监控规则管理
告警监控规则是指可以设置对时序,日志等数据源的监控设置,包含协同监控,分组评估,触发条件设置,严重度设置,无数据告警和告警恢复等条件参数的设置。
告警资源数据管理
在SLS告警中,监控规则触发后,触发的告警消息会发现设置好的告警策略中,告警策略包含合并、静默、抑制等降噪处理,经过降噪处理后会发往指定的行动策略,行动策略可以简单的理解为通知渠道;
通知渠道包含短信、语音、邮件、Webhook、钉钉、微信、飞书、Function Compute和EventBridge等。里面涉及用户、用户组、Webhook的管理;
以上告警策略、行动策略、用户、用户组、Webhook等,在SLS中统称为告警资源数据。
使用Terraform管理SLS告警
配置身份信息以及告警相关的中心区域
export ALICLOUD_ACCESS_KEY="LTAIUrZCw3********"
export ALICLOUD_SECRET_KEY="zfwwWAMWIAiooj14GQ2*************"
export ALICLOUD_REGION="cn-heyuan"
初始化阿里云告警资源
如下配置会在ALICLOUD_REGION下创建资源如下:
- project:名称格式为sls-alert-{uid}-{region}
- logstore:internal-alert-center-log(该logstore免费)
- Project内置仪表盘:全局告警排障中心,全局告警链路中心,全局告警规则中心,开放告警中心
- 具体参数含义可以参考:alicloud_log_alert_resource
data "alicloud_log_alert_resource""example" { type ="user" lang ="cn"}
初始化Project告警资源
如下配置会在test-project下创建如下资源:
- logstore:internal-alert-log (该logstore免费)
- 告警仪表盘
- 注意,test-project需要在ALICLOUD_REGION这个region下
- 具体参数含义可以参考:alicloud_log_alert_resource
data "alicloud_log_alert_resource""example" { type ="project" project ="test-project"}
创建告警规则
以下配置将会创建告警监控规则,主要包括如下内容:
- 告警名称、定时策略、无数据告警等
- 查询列表,可以指定logstore和metricstore查询
- 标签,标注配置,分组评估,严重度配置等
- 告警策略和行动策略配置
- 具体参数含义参考:alicloud_log_alert
resource "alicloud_log_alert""example" { version ="2.0" type ="default" project_name ="test-project" alert_name ="tf-test-alert-2" alert_displayname ="tf-test-alert-displayname-2" dashboard ="tf-test-dashboard" mute_until ="1632486684" no_data_fire ="false" no_data_severity =8 send_resolved =true schedule_interval ="5m" schedule_type ="FixedRate" query_list { store ="tf-test-logstore" store_type ="log" project ="test-project" region ="cn-heyuan" chart_title ="chart_title"start="-60s" end ="20s" query ="* AND aliyun | select count(1) as cnt" time_span_type ="Custom" } query_list { store ="tf-test-logstore-5" store_type ="log" project ="test-project" region ="cn-heyuan" chart_title ="chart_title"start="-60s" end ="20s" query ="error | select count(1) as error_cnt" time_span_type ="Custom" } join_configurations { type ="cross_join" condition ="" } labels { key ="env" value ="test" } labels { key ="env1" value ="test1" } annotations { key ="title" value ="alert title-1" } annotations { key ="desc" value ="alert desc" } annotations { key ="test_key" value ="test value" } group_configuration { type ="custom" fields = ["a", "b", "d"] } severity_configurations { severity =8 eval_condition = { condition ="cnt > 3" count_condition ="__count__ > 3" } } severity_configurations { severity =6 eval_condition = { condition ="" count_condition ="__count__ > 0" } } severity_configurations { severity =2 eval_condition = { condition ="" count_condition ="" } } policy_configuration { alert_policy_id ="sls.builtin.dynamic" action_policy_id ="sls_test_action" repeat_interval ="1m" } }
告警资源创建
告警资源主要包括用户、用户组、值班组、webhook集成、告警策略、行动策略、内容模板、默认日志和渠道额度等。接下来以用户创建为例,介绍terraform格式,下面附有相关资源列表及结构介绍。
用户创建
- resource_name使用资源类型表格中的sls.common.user
- record_id表示用户的ID
- tag表示用户名称
- value是一个JSON字符串,参照上表中的结构示例
resource "alicloud_log_resource_record""user" { resource_name ="sls.common.user" record_id ="test_tf_user" tag ="test tf user" value ="{\n\t\"user_name\": \"test tf user\", \n\t\"sms_enabled\": true, \n\t\"phone\": \"18888888889\", \n\t\"voice_enabled\": false, \n\t\"email\": [\n\t\t\"test@qq.com\"\n\t], \n\t\"enabled\": true, \n\t\"user_id\": \"test_tf_user\", \n\t\"country_code\": \"86\"\n}"}
相关资源列表
资源类型 |
resource_name |
record_id |
tag |
value结构举例 |
备注 |
用户 |
sls.common.user |
值同user_id |
值同user_name |
{ "user_id": "xiaoming", "user_name": "小明", "email": [ "xiaoming@example.com" ], "country_code": "86", "phone": "13334567890", "enabled": true, "sms_enabled": true, "voice_enabled": true } |
|
用户组 |
sls.common_user_group |
值同 user_group_id |
值同 user_group_name |
{ "user_group_id": "group-xiaoming", "user_group_name": "分组-小明", "enabled": true, "members": [ "xiaoming" ] } |
|
值班组 |
sls.alert.oncall_group |
值同oncall_id |
值同 oncall_name |
{ "oncall_id": "default_oncall", "oncall_name": "default oncall", "enabled": true, "overrides": [], "rotations": [ { "targets": [ { "type": "user", "target_id": "jizhi" }, { "type": "user_group", "target_id": "alert-dev" } ], "end_time": 0, "shift_day": "", "shift_time": "12:00", "shift_type": "day", "start_time": 1633017600, "shift_minute": 0, "end_time_type": "none", "shift_interval": 1, "shift_week_custom": null, "restriction_date_type": "workday", "restriction_time_type": "allday", "restriction_week_range": null, "restriction_time_custom_range": null } ], "calendar_id": "default_calendar" } |
|
webhook集成 |
sls.alert.action_webhook |
值同id |
值同name |
{ "id": "custom-test", "name": "自定义webhook测试", "type": "custom", "url": "http://localhost:9099/data/webhook", "method": "POST", "headers": [ { "key": "Content-Type", "value": "application/json" }, { "key": "Foo", "value": "bar" } ] } |
|
告警策略 |
sls.alert.alert_policy |
值同policy_id |
值同policy_name |
{ "policy_id": "sls.builtin", "policy_name": "内置告警策略", "parent_id": "sls.root", "is_default": false, "group_script": "fire(action_policy=\"sls.builtin\", group={\"project\": \"__a__\", \"uid\": alert.aliuid}, group_wait=\"5s\", group_interval=\"2m\", repeat_interval=\"2m\")\nstop()\nfire(action_policy=\"sls.builtin\", group={\"alert_id\": alert.alert_id}, group_wait=\"5s\", group_interval=\"10s\", repeat_interval=\"2m\")\nif alert.labels.name ~= \"^\\\\w+s$\":\n\tfire(action_policy=\"sls.builtin\", group={\"product\": \"xxs\"}, group_wait=\"5s\", group_interval=\"10s\", repeat_interval=\"2m\")\n\tstop()\nstop()\nfire(action_policy=\"sls.builtin\", group={\"label_name\": alert.labels.name}, group_wait=\"10s\", group_interval=\"10s\", repeat_interval=\"2m\")", "inhibit_script": "if alert.severity >= 8:\n silence alert.severity < 6", "silence_script": "" } |
|
行动策略 |
sls.alert.action_policy |
值同action_policy_id |
值同action_policy_name |
{ "action_policy_id": "sls.builtin", "action_policy_name": "默认行动策略", "labels": {}, "is_default": false, "primary_policy_script": "fire(type=\"webhook_integration\", integration_type=\"dingtalk\", webhook_id=\"dingtalk-test\", template_id=\"default-template\", period=\"any\")", "secondary_policy_script": "fire(type=\"voice\", users=[\"jizhi\"], groups=[\"group-jizhi\"], template_id=\"default-template\")", "escalation_start_enabled": false, "escalation_start_timeout": "10s", "escalation_inprogress_enabled": false, "escalation_inprogress_timeout": "10s", "escalation_enabled": false, "escalation_timeout": "4h0m0s" } |
|
内容模板 |
sls.alert.content_template |
值同template_id |
值同template_name |
{ "template_id": "default-template", "template_name": "默认模板", "is_default": false, "templates": { "fc": { "limit": 0, "locale": "zh-CN", "content": "", "send_type": "merged" }, "sms": { "locale": "zh-CN", "content": "" }, "lark": { "title": "Alerthub告警测试 ${alert_name}", "locale": "zh-CN", "content": "" }, "email": { "locale": "zh-CN", "content": "", "subject": "SLS告警测试-jizhi-test" }, "slack": { "title": "Alerthub告警测试 ${alert_name}", "locale": "zh-CN", "content": "" }, "voice": { "locale": "zh-CN", "content": "" }, "wechat": { "title": "Alerthub告警测试 ${alert_name}", "locale": "zh-CN", "content": "" }, "webhook": { "limit": 0, "locale": "zh-CN", "content": "", "send_type": "merged" }, "dingtalk": { "title": "Alerthub告警测试 ${alert_name}", "locale": "zh-CN", "content": "" }, "event_bridge": { "locale": "zh-CN", "content": "", "subject": "wkb-test" }, "message_center": { "locale": "zh-CN", "content": "" } } } |
|
默认日历 |
sls.common.calender |
值同calender_id |
值同calender_name |
{ "calendar_id": "default_calendar", "calendar_name": "默认日历", "timezone": "Asia/Shanghai", "workdays": [ 1, 2, 3, 4, 5 ], "worktime": [ { "end_time": "21:00", "start_time": "09:00" } ], "reset_days": [], "holiday_sync": "china" } |
|
渠道额度 |
sls.alert.channel_quota |
值同id |
值空 |
{ "id": "default", "quota_script": "if user in [\"jizhi\"]:\n set_limit(sms=5, voice=5, email=5)\nset_limit(sms=100, voice=100, email=100)" } |
|
Terraform常用命令
- 创建terraform.tf文件,输入上述内容,并保存在当前的执行目录中。
- terraform init:初始化terraform配置
- terraform plan:可以查看terraform.tf将修改与已生效(apply)的差异,结果以diff形式展示
- terraform apply:将terraform.tf中的资源的创建和更新
- terraform destory:对资源进行销毁
- terraform import:对已有资源(通过非Terraform创建和管理的资源)进行导入。
参考
- 日志服务(SLS):https://www.aliyun.com/product/sls
- 什么是日志服务告警:https://help.aliyun.com/document_detail/209951.html
- 使用SDK管理SLS告警:https://developer.aliyun.com/article/789819
- SLS告警资源Terraform:https://registry.terraform.io/providers/aliyun/alicloud/latest/docs/resources/log_alert
- 什么是Terraform:https://help.aliyun.com/document_detail/95820.html
- Terraform:https://www.terraform.io/docs
- 本地安装与配置Terraform:https://help.aliyun.com/document_detail/95825.html
- 欢迎扫群加入阿里云-日志服务(SLS)技术交流
- 后续系列直播与培训视频会同步到B站,敬请留意