正文
一、什么是SkyWalking
SkyWalking是一个开源的观测平台,用于从服务和云原生等基础设施中收集、分析、聚合以及可视化数据。SkyWalking 提供了一种简便的方式来清晰地观测分布式系统。相比较zipkin而言,skywalking利用agent字节码增强技术实现代码无侵入,通信方式采用GRPC,性能较好,实现方式是java探针,支持告警,支持JVM监控,支持全局调用统计,UI界面更加强大等优点。
二、SkyWalking的搭建
skywalking使用Agent代理技术,相当于JVM层面的AOP 技术,在执行我们项目的时候,skywalking经过代理,然后收集我们的系统数据,然后经过可视化界面展示。
安装skywalking
从8.8.0开始,Java Agent从原始的主存储库中分离出来。所以我们需要下载两个文件
Apache Downloads 主程序
Apache Downloads agent代理
在apache-skywalking-apm-9.0.0\apache-skywalking-apm-bin\bin路径下启动,在webapp中的webapp.yml中可以修改启动端口,默认是8080
然后访问 ip:8080
agent模块解压后有skywalking-agent.jar
三、整合springboot
1、整合skywalking
由于skywalking是无代码侵入式的,启动项目时需要配置jvm参数
#skywalking-agent.jar的路径 -javaagent:E:\java-tools\skywalking\skywalking-agent\skywalking-agent.jar #启动服务的名称 -Dskywalking.agent.service_name=xiaojie-sso #连接到skywalking的地址 -Dskywalking.collector.backend_service=127.0.0.1:11800
2、初识skywalking界面
安装好之后可以随意点点,提供了CPU、JVM、垃圾回收,数据库、线程池、日志、拓扑图等等很多信息。
3、上报日志
maven依赖
<dependency> <groupId>org.apache.skywalking</groupId> <artifactId>apm-toolkit-logback-1.x</artifactId> <version>8.10.0</version> </dependency>
logback.yml
<?xml version="1.0" encoding="utf-8" ?> <!---scan这个属性是用来查看配置信息的,scanPeriod的值是固定多长时间扫描一次,周期内新生成的文件会覆盖旧文件--> <configuration> <property name="LOG_FILE_LOCATION" value="./log/"/> <property name="CONSOLE_LOG_PATTERN" value="%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %highlight(%-5level) %cyan(%logger{50}) - %highlight(%msg) %n"/> <property name="FILE_LOG_PATTERN" value="%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{50} - %msg%n"/> <appender name="consoleLog" class="ch.qos.logback.core.ConsoleAppender"> <layout class="ch.qos.logback.classic.PatternLayout"> <pattern>${CONSOLE_LOG_PATTERN}</pattern> </layout> </appender> <appender name="fileInfoLog" class="ch.qos.logback.core.rolling.RollingFileAppender"> <filter class="ch.qos.logback.classic.filter.LevelFilter"> <level>ERROR</level> <!--匹配就舍去--> <onMatch>DENY</onMatch> <onMismatch>ACCEPT</onMismatch> </filter> <file>${LOG_FILE_LOCATION}/info.log</file> <encoder> <pattern> ${FILE_LOG_PATTERN} </pattern> </encoder> <!--滚动策略--> <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy"> <!--日志文件输出的文件名--> <FileNamePattern>${LOG_FILE_LOCATION}/bak/info.%d{yyyy-MM-dd}.%i.log.gz</FileNamePattern> <!--日志文件保留天数--> <MaxHistory>30</MaxHistory> <MaxFileSize>10MB</MaxFileSize> </rollingPolicy> </appender> <appender name="fileErrorLog" class="ch.qos.logback.core.rolling.RollingFileAppender"> <filter class="ch.qos.logback.classic.filter.ThresholdFilter"> <level>ERROR</level> </filter> <file>${LOG_FILE_LOCATION}/error.log</file> <encoder> <pattern> <pattern>${FILE_LOG_PATTERN} </pattern> </pattern> </encoder> <!--滚动策略--> <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy"> <!--日志文件输出的文件名--> <FileNamePattern>${LOG_FILE_LOCATION}/bak/error.%d{yyyy-MM-dd}.%i.log.gz</FileNamePattern> <!--日志文件保留天数--> <MaxHistory>30</MaxHistory> <MaxFileSize>10MB</MaxFileSize> </rollingPolicy> </appender> <!--打印tranceid--> <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender"> <encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder"> <layout class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.mdc.TraceIdMDCPatternLogbackLayout"> <Pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%X{tid}] [%thread] %-5level %logger{36} -%msg%n</Pattern> </layout> </encoder> </appender> <appender name="ASYNC" class="ch.qos.logback.classic.AsyncAppender"> <discardingThreshold>0</discardingThreshold> <queueSize>1024</queueSize> <neverBlock>true</neverBlock> <appender-ref ref="STDOUT"/> </appender> <!--上报日志--> <appender name="grpc-log" class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.log.GRPCLogClientAppender"> <encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder"> <layout class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.mdc.TraceIdMDCPatternLogbackLayout"> <Pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%X{tid}] [%thread] %-5level %logger{36} -%msg%n</Pattern> </layout> </encoder> </appender> <root level="INFO"> <appender-ref ref="grpc-log"/> <appender-ref ref="ASYNC"/> <appender-ref ref="consoleLog"/> <appender-ref ref="fileInfoLog"/> <appender-ref ref="fileErrorLog"/> </root> </configuration>
4、持久化配置
修改apache-skywalking-apm-9.0.0\apache-skywalking-apm-bin\config路径下的application.yml
第一处、修改storage为mysql,我这里以mysql为例,支持多种修改
第二处、修改mysql的连接
需要手动在数据库创建swtest的数据库,然后启动skywalking,表自动创建。
四、报警系统
apache-skywalking-apm-9.0.0\apache-skywalking-apm-bin\config路径下的alarm-settings.yml配置了告警策略
rules: # Rule unique name, must be ended with `_rule`. service_resp_time_rule: metrics-name: service_resp_time op: ">" threshold: 1000 period: 10 count: 3 silence-period: 5 #过去3分钟内服务平均响应时间超过1秒 message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes. service_sla_rule: # Metrics value need to be long, double or int metrics-name: service_sla op: "<" threshold: 8000 # The length of time to evaluate the metrics period: 10 # How many times after the metrics match the condition, will trigger alarm count: 2 # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period. silence-period: 3 #服务成功率在过去2分钟内低于80% message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes service_resp_time_percentile_rule: # Metrics value need to be long, double or int metrics-name: service_percentile op: ">" threshold: 1000,1000,1000,1000,1000 period: 10 count: 3 silence-period: 5 #再过去的3分钟内有超过50%,75%,90%,95%,99%响应时间大于1000毫秒 message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000 service_instance_resp_time_rule: metrics-name: service_instance_resp_time op: ">" threshold: 1000 period: 10 count: 2 silence-period: 5 #最近2分钟内服务实例的平均响应时间超过1秒 message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes database_access_resp_time_rule: metrics-name: database_access_resp_time threshold: 1000 op: ">" period: 10 count: 2 #最近2分钟内数据库的平均响应时间超过1秒 message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes endpoint_relation_resp_time_rule: metrics-name: endpoint_relation_resp_time threshold: 1000 op: ">" period: 10 count: 2 #端点平均响应时间过去2分钟超过1秒,断点可以理解为某个路径 message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm. # Because the number of endpoint is much more than service and instance. # # endpoint_resp_time_rule: # metrics-name: endpoint_resp_time # op: ">" # threshold: 1000 # period: 10 # count: 2 # silence-period: 5 # message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes webhooks: - http://127.0.0.1:8090/notify/ #告警路径,需要我们自定义接口实现,post接口 # - http://127.0.0.1/go-wechat/
属性参照如下
定义告警实体类
@Data public class AlarmMessageDto { private String scopeId; private String name; private String id0; private String id1; private String alarmMessage; private long startTime; }
定义接口
@Override public void send(List<AlarmMessageDto> alarmMessageList) { //实际生产中应当单独建立一个监控系统的服务,使用mq 发送短信,邮件或者微信公众号模板方式解决,此处只是演示 for (AlarmMessageDto alarm: alarmMessageList) { log.info("报警信息为>>>>>>>>{}",alarm.getAlarmMessage()); } }
完整代码参考 spring-boot: Springboot整合redis、消息中间件等相关代码
、