SLS全栈监控数据分析-阿里云开发者社区

本文通过展示主机监控, 数据库监控, 应用监控帮助用户完成从基础设施到业务层面的监控, 我们在每个示例中使用不同的数据采集和可视化实现方式, 帮助用户全面了解SLS提供的时序监控能力

监控层次

在我们实施监控时, 服务端监控至少包含以下部分:

基础设施&网络在云时代绝大多数用户已经不再需要关心, 因此我们主要关心操作系统/数据库&中间件以及应用&业务的监控
SLS拥抱开源, 可以借助成熟的监控软件提供的能力, 如Prometheus, telegraf, Grafana等, 构建灵活的解决方案.

灵活的数据收集方案

例如Prometheus支持众多Exporter, 并且是kubernetes标配, 那么我们可以选择用Prometheus exporter暴露数据, 用Prometheus进行采集, 通过remote write协议写入Metric Store(参见文档: 采集Prometheus监控数据_数据接入_时序存储_日志服务-阿里云), 下文将以Java应用监控为例展示该用法
telegraf同样支持众多采集插件, 因此也可以选择用telegraf进行采集, 并通过influxdb的协议写入Metric Store, 下文将以MySQL监控为例展示该用法
同时SLS的logtail本身也有采集能力, 因此也可以使用logtail进行采集, 例如我们提供的主机监控: 采集主机监控数据_数据接入_时序存储_日志服务-阿里云

查询基础

在展示具体例子之前, 我们先学习一点查询语法作为基础
SLS Metric Store支持使用SQL + PromQL进行查询, 使用方法为使用PromQL函数进行查询, 然后可将该查询作为子查询嵌套完整SQL语法:

SELECT 
    promql_query_range('up', '1m') 
FROM metrics;
SELECT 
    sum(value) 
FROM 
    (SELECT promql_query_range('up', '1m') 
    FROM metrics);

其中promql_query_range的第一个参数就是PromQL, 第二个参数为step, 即时间粒度在MetricStore查询页面中, 可在Metrics下拉框中选择指标, 会自动生成最简单的查询, 点击预览即可看到图表:

PromQL语法入门

例子:

avg(go_gc_duration_seconds{endpoint = “http-metrics”}) by (instance)

完整PromQL语法可查看Prometheus官方文档: https://prometheus.io/docs/prometheus/latest/querying/basics/
SLS支持Prometheus中主要的几个函数, 完整列表见: 时序数据查询分析简介_查询与分析_时序存储_日志服务-阿里云
单是看语法说明难免有些枯燥, 下面我们就进入实战环节!

主机监控

主机监控我们采用logtail收集操作系统指标, 直接写入Metric Store, 同时我们提供了内置的dashboard做可视化, 它的数据流如下:

操作步骤

新建logtail配置:

选择主机监控

选择机器组

确认插件配置

IntervalMs代表采集间隔, 默认30s, 可保持默认点击下一步即可完成
创建完成后即可在左边dashboard列表中找到主机监控

稍等1-2分钟即可看到数据产生

数据库监控

数据库监控采用telegraf进行采集, 并通过logtail支持的http receiver插件传输数据, logtail将把数据写入metric store

telegraf写入logtail走influxdb协议, 因此在telegraf中按照influxdb配置即可
首先我们先创建一个logtail配置, 用于接收telegraf的数据, 新建logtail配置, 选择自定义数据插件:

选择机器组(参照主机监控中的步骤) 输入配置名称, 并粘贴以下内容:

{
    "inputs": [
        {
            "detail": {
                "Format": "influx",
                "Address": ":8476"
            },
            "type": "service_http_server"
        }
    ],
    "global": {
        "AlwaysOnline": true,
        "DelayStopSec": 500
    }
}

点击下一步即可完成接着我们开始配置telegraf 修改telegraf.conf, 默认在_etc_telegraf/telegraf.conf, 建议备份原文件, 新建并粘贴以下内容:

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"
# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true
  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000
  ## Maximum number of unwritten metrics per output.  Increasing this value
  ## allows for longer periods of output downtime without dropping metrics at the
  ## cost of higher maximum memory usage.
  metric_buffer_limit = 10000
  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"
  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"
  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""
  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  # logfile_rotation_max_archives = 5
  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false
      
###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################
# Configuration for sending metrics to Logtail's InfluxDB receiver
[[outputs.influxdb]]
  ## The full HTTP Logtail listen address
  urls = ["http://127.0.0.1:8476"]
  ## Always be true
  skip_database_creation = true

再在telegraf.d目录中新建mysql.conf文件, 粘贴以下内容:

[[inputs.mysql]]
  ## specify servers via a url matching:
  ##  [username[:password]@][protocol[(address)]]/[?tls=[true|false|skip-verify|custom]]
  ##  see https://github.com/go-sql-driver/mysql#dsn-data-source-name
  ##  e.g.
  ##    servers = ["user:passwd@tcp(127.0.0.1:3306)/?tls=false"]
  ##    servers = ["user@tcp(127.0.0.1:3306)/?tls=false"]
  #
  ## If no servers are specified, then localhost is used as the host.
  servers = ["user:passwd@tcp(127.0.0.1:3306)/?tls=false"]
  metric_version = 2
  ## if the list is empty, then metrics are gathered from all databasee tables
  table_schema_databases = []
  ## gather metrics from INFORMATION_SCHEMA.TABLES for databases provided above list
  gather_table_schema = false
  ## gather thread state counts from INFORMATION_SCHEMA.PROCESSLIST
  gather_process_list = false
  ## gather user statistics from INFORMATION_SCHEMA.USER_STATISTICS
  gather_user_statistics = false
  ## gather auto_increment columns and max values from information schema
  gather_info_schema_auto_inc = false
  ## gather metrics from INFORMATION_SCHEMA.INNODB_METRICS
  gather_innodb_metrics = true
  ## gather metrics from SHOW SLAVE STATUS command output
  gather_slave_status = false
  ## gather metrics from SHOW BINARY LOGS command output
  gather_binary_logs = false
  ## gather metrics from SHOW GLOBAL VARIABLES command output
  gather_global_variables = true
  ## gather metrics from PERFORMANCE_SCHEMA.TABLE_IO_WAITS_SUMMARY_BY_TABLE
  gather_table_io_waits = false
  ## gather metrics from PERFORMANCE_SCHEMA.TABLE_LOCK_WAITS
  gather_table_lock_waits = false
  ## gather metrics from PERFORMANCE_SCHEMA.TABLE_IO_WAITS_SUMMARY_BY_INDEX_USAGE
  gather_index_io_waits = false
  ## gather metrics from PERFORMANCE_SCHEMA.EVENT_WAITS
  gather_event_waits = false
  ## gather metrics from PERFORMANCE_SCHEMA.FILE_SUMMARY_BY_EVENT_NAME
  gather_file_events_stats = false
  ## gather metrics from PERFORMANCE_SCHEMA.EVENTS_STATEMENTS_SUMMARY_BY_DIGEST
  gather_perf_events_statements = false
  ## the limits for metrics form perf_events_statements
  perf_events_statements_digest_text_limit = 120
  perf_events_statements_limit = 250
  perf_events_statements_time_limit = 86400
  ## Some queries we may want to run less often (such as SHOW GLOBAL VARIABLES)
  ##   example: interval_slow = "30m"
  interval_slow = ""
  ## Optional TLS Config (will be used if tls=custom parameter specified in server uri)
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false omit_hostname = false
[[processors.strings]]
  namepass = ["mysql", "mysql_innodb"]
  [[processors.strings.replace]]
    tag = "server"
    old = "127.0.0.1:3306"
    new = "mysql-dev"
  [[processors.strings.replace]]
    tag = "server"
    old = "192.168.1.98:3306"
    new = "mysql-prod"

注意修改servers字段为对应的MySQL连接串重启telegraf即可:

sudo service telegraf reload
# 或者
sudo systemctl reload telegraf

稍等1-2分钟刷新页面, 选择Metrics, 即可看到数据, MySQL监控暂时未提供预置dashboard, 可自行配置, 后续SLS将对常用数据库和中间件提供默认dashboard模板

应用监控

应用监控中我们以Spring Boot应用为例, 使用Spring Boot Actuator暴露数据, 通过Prometheus采集, 并使用remote write 协议写入Metric Store, 再使用Grafana对接做可视化, 整个数据流如下:

首先我们需要引入两个依赖:

<dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-actuator</artifactId>
 </dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
    <version>1.1.3</version>
</dependency>

接着修改spring boot配置, 默认在resources/application.yml, 没有的话请创建:

server:
  port: 8080
spring:
  application:
    name: spring-demo # 修改成您的应用名
management:
  endpoints:
    web:
      exposure:
        include: 'prometheus' # 暴露/actuator/prometheus
  metrics:
    tags:
      application: ${spring.application.name} # 暴露的数据中添加application label

启动应用, 访问http://localhost:8080/actuator/prometheus, 应该看到如下数据:

# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{application="spring-demo",area="heap",id="PS Eden Space",} 1.77733632E8
jvm_memory_committed_bytes{application="spring-demo",area="nonheap",id="Metaspace",} 3.6880384E7
jvm_memory_committed_bytes{application="spring-demo",area="heap",id="PS Old Gen",} 1.53092096E8
jvm_memory_committed_bytes{application="spring-demo",area="heap",id="PS Survivor Space",} 1.4680064E7
jvm_memory_committed_bytes{application="spring-demo",area="nonheap",id="Compressed Class Space",} 5160960.0
jvm_memory_committed_bytes{application="spring-demo",area="nonheap",id="Code Cache",} 7798784.0
# HELP jvm_classes_unloaded_classes_total The total number of classes unloaded since the Java virtual machine has started execution
# TYPE jvm_classes_unloaded_classes_total counter
jvm_classes_unloaded_classes_total{application="spring-demo",} 0.0
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
jvm_memory_max_bytes{application="spring-demo",area="nonheap",id="Code Cache",} 2.5165824E8
# HELP jvm_classes_loaded_classes The number of classes that are currently loaded in the Java virtual machine
# TYPE jvm_classes_loaded_classes gauge
jvm_classes_loaded_classes{application="spring-demo",} 7010.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads{application="spring-demo",} 24.0
# HELP jvm_threads_states_threads The current number of threads having NEW state
# 太长, 后面省略

现在数据已经暴露出来了, 我们需要配置Prometheus进行采集, 修改Prometheus的配置文件:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: "spring-demo"
    metrics_path: "/actuator/prometheus"
    static_configs:
    - targets: ["localhost:8080"]
remote_write:
  - url: "https://cn-zhangjiakou-share.log.aliyuncs.com/prometheus/sls-metric-store-test-cn-zhangjiakou/test-logstore-1/api/v1/write"
  # - url: "https://sls-zc-test-bj-b.cn-beijing-share.log.aliyuncs.com/prometheus/sls-zc-test-bj-b/prometheus-spring/api/v1/write"
    basic_auth:
      username: ${accessKeyId}
      password: ${accessKeySecret}
    # Configures the queue used to write to remote storage.
    queue_config:
      max_samples_per_send: 2048
      batch_send_deadline: 20s
      min_backoff: 100ms
      max_backoff: 5s
      # max_retries: 10

其中scrape_configs是用来采集我们的应用数据的, remote_write部分用于将数据写入Metric Store, 注意替换basic_auth中的username和password为您对应的accessKeyId和accessKeySecret 配置完成后重启Prometheus, 可访问http://${prometheus_域名}/graph选择metric查看是否采集成功接着我们要配置grafana进行可视化首先要把我们的Metric Store接入到Grafana的数据源中:

数据源接入成功后, 就可以配置dashbaord了, 我们已经在grafana.com上传了模板: SLS JVM监控大盘(via MicroMeter) dashboard for Grafana | Grafana Labs 直接在grafana中导入即可: 做侧边栏选择+ Import 粘贴url: https://grafana.com/grafana/dashboards/12856 选择上一步创建的数据源点击Load 这样就配置完成了, 我们完整的dashboard是这样的:

总结

我们首先介绍了SLS时序数据的查询方式, 接着我们通过主机监控, MySQL监控, Spring Boot应用监控三种监控类型向大家分别展示了多种不同的数据接入, 可视化方法, 大家可以根据自身的环境选择最容易使用的方式进行接入, 当数据都存储在SLS上以后, 就可以使用SLS提供的SQL语法, PromQL语法对数据进行分析挖掘, 祝大家使用愉快! 如有任何问题, 可提工单, 或在用户群中反馈(见下放钉钉二维码), 也欢迎关注我们的微信公众号, 会推送实用的使用技巧和最佳实践哦~

SLS全栈监控数据分析

监控层次

灵活的数据收集方案

查询基础

PromQL语法入门

主机监控

操作步骤

数据库监控

应用监控

总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

SLS全栈监控数据分析

监控层次

灵活的数据收集方案

查询基础

PromQL语法入门

主机监控

操作步骤

数据库监控

应用监控

总结

热门文章

最新文章

相关课程

相关电子书

相关实验场景