四、案例分析
4.1案例:Function trace
- function默认会记录当前运行过程中的所有函数;
- 如果只想跟踪某个进程,可以使用set_ftrace_pid;
- 如果只想跟踪某个函数,可以使用set_ftrace_filter;
(0)关闭tracer
echo 0 > tracing_on #清空打印 echo > trace
(1)设置类型:设current_tracer 类型为 function
echo blk_update_request > set_ftrace_filter
设置不跟踪指定函数;
echo > set_ftrace_filter echo blk_update_request > set_ftrace_notrace
还可以用通配符"
#'*match*':匹配所有包含match的函数; echo 'hrtimer_*' >> set_ftrace_filter #过滤所有以"hrtimer_"开头的函数
还可以做基于模块的过滤:例如,过滤 ext4 module 的 write* 函数
#控制范式:<function>:<command>:<parameter> echo 'write*:mod:ext4' > set_ftrace_filter
感叹号用来移除某个函数,把多个组合过滤条件的某一个去掉:
echo '!ip_rcv' >> set_ftrace_filter
遇到 __schedule_bug 函数后关闭 trace
# echo '__schedule_bug:traceoff' > set_ftrace_filter
(3)打开trace开关,开始 trace
# echo 1 > tracing_on
(4)提取trace结果
[root@myQEMU tracing]# cat trace # tracer: function # # entries-in-buffer/entries-written: 6/6 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | <idle>-0 [001] ..s. 229.298724: blk_update_request <-blk_mq_end_request <idle>-0 [001] ..s. 229.316890: blk_update_request <-blk_mq_end_request <idle>-0 [001] ..s. 229.317916: blk_update_request <-blk_mq_end_request <idle>-0 [001] ..s. 229.325217: blk_update_request <-blk_mq_end_request <idle>-0 [001] d.s. 229.325280: blk_update_request <-blk_mq_end_request sync-620 [001] ..s. 229.330068: blk_update_request <-blk_mq_end_request [root@myQEMU tracing]#
从 trace 信息我们可以获取很多重要信息:
- 进程信息,TASK-PID
- 进程运行的 CPU
- 执行函数时的系统状态,包括中断,抢占等状态信息
- 执行函数的时间戳
function_graph使用方法和function类似,但可以更清晰的展示函数调用关系;
开启堆栈调用:可以跟踪函数被调用流程
[root@myQEMU tracing]# echo 1 > options/func_stack_trace [root@myQEMU tracing]# cat /cpu.info [root@myQEMU tracing]# cat trace # tracer: function # # entries-in-buffer/entries-written: 18/18 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | cat-646 [000] ..s. 548.454890: blk_update_request <-blk_mq_end_request cat-646 [000] ..s. 548.454915: <stack trace> => blk_update_request => blk_mq_end_request => virtblk_request_done => blk_complete_reqs => blk_done_softirq => __do_softirq => irq_exit => __handle_domain_irq => gic_handle_irq => el1_irq => get_page_from_freelist => __alloc_pages => page_cache_ra_unbounded => do_page_cache_ra => filemap_fault => ext4_filemap_fault => __do_fault => __handle_mm_fault => handle_mm_fault => do_page_fault => do_translation_fault => do_mem_abort => el0_ia => el0_sync_handler => el0_sync
4.2案例:function_graph
使用方法和function类似,但可以更清晰的展示函数调用关系;
开启堆栈调用,可以查看函数调用子函数堆栈
echo blk_update_request > set_graph_function echo function_graph > current_tracer echo 1 > options/func_stack_trace echo 1 > tracing_on cat /cpu.info
查看trace
[root@myQEMU home]# cat /tracing/trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) + 24.368 us | start_backtrace(); 0) 4.336 us | start_backtrace(); 0) | blk_update_request() { 0) 3.200 us | start_backtrace(); 0) 3.152 us | start_backtrace(); 0) + 68.576 us | } 0) | blk_update_request() { 0) 4.192 us | bio_advance(); 0) | bio_endio() { 0) | submit_bio_wait_endio() { 0) | complete() { 0) | swake_up_locked() { 0) | swake_up_locked.part.0() { 0) | wake_up_process() { 0) | try_to_wake_up() { 0) | select_task_rq_fair() { 0) 3.808 us | available_idle_cpu(); 0) 3.856 us | rcu_read_unlock_strict(); 0) + 23.696 us | } 0) 3.968 us | ttwu_queue_wakelist(); 0) 4.960 us | update_rq_clock.part.0(); 0) | ttwu_do_activate.isra.0() { 0) | enqueue_task_fair() { 0) | update_curr() { 0) 4.064 us | __calc_delta(); 0) 4.080 us | update_min_vruntime(); 0) 4.192 us | rcu_read_unlock_strict(); 0) + 31.056 us | } 0) 5.104 us | __update_load_avg_se(); 0) | __update_load_avg_cfs_rq() { 0) 4.208 us | __accumulate_pelt_segments(); 0) + 13.840 us | } 0) 4.544 us | __enqueue_entity(); 0) + 77.328 us | } 0) | ttwu_do_wakeup.isra.0() { 0) | check_preempt_curr() { 0) | check_preempt_wakeup() { 0) 4.688 us | update_curr(); 0) 4.640 us | wakeup_preempt_entity.isra.0(); 0) + 23.616 us | } 0) + 32.976 us | } 0) + 50.688 us | } 0) ! 141.648 us | } 0) ! 198.064 us | } 0) ! 206.528 us | } 0) ! 215.072 us | } 0) ! 223.376 us | } 0) ! 232.320 us | } 0) ! 240.960 us | } 0) ! 250.400 us | } 0) ! 269.840 us | } [root@myQEMU tracing]#
小技巧:用vim插件Documentation/trace/function-graph-fold.vim 打开,可以方便折叠函数方便查看,za(展开)/zc(折叠)
vim ft_graph.log -S Documentation/trace/function-graph-fold.vim # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) + 24.368 us | start_backtrace(); 0) 4.336 us | start_backtrace(); + 0) | blk_update_request() {---------------------------------------------------------------------------------------------------- - 0) | blk_update_request() { | 0) 4.192 us | bio_advance(); |- 0) | bio_endio() { ||- 0) | submit_bio_wait_endio() { |||- 0) | complete() { ||||+ 0) | swake_up_locked() {----------------------------------------------------------------------------------------------- |||| 0) ! 232.320 us | } ||| 0) ! 240.960 us | } || 0) ! 250.400 us | } | 0) ! 269.840 us | }
4.3案例:事件 trace
Ftrace的跟踪机制主要有两种:函数和tracepoint。
函数,案例1已经做了介绍;tracepoint,可以理解为linux内核的占位符号,tracepoint可以输出开发者想要的参数,局部变量等。类似于C语言的#if DEBUG部分,不开启的话,不占用任何系统开销;
内核里常遇到以"trace_"开头的函数,比如CFS调度器里de update_curr()函数
static void update_curr(struct cfs_rq *cfs_rq) { ... update_min_vruntime(cfs_rq); if (entity_is_task(curr)) { struct task_struct *curtask = task_of(curr); trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime); } ... }
update_curr()函数使用了一个sched_stat_runtime的tracepoint, 可以在available_events看到;
查询系统支持的事件:
cat available_events |grep sched_stat_runtime sched:sched_stat_runtime
操作过程:
echo 0 > tracing_on echo > trace echo sched:sched_stat_runtime > set_event echo 1 > tracing_on #或者直接进入时间目录,等价操作: cd /tracing/events/sched/sched_switch echo 1 enable cat /tracing/trace
格式解析:
[root@myQEMU sched_stat_runtime]# cat format name: sched_stat_runtime ID: 208 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:char comm[16]; offset:8; size:16; signed:0; field:pid_t pid; offset:24; size:4; signed:1; field:u64 runtime; offset:32; size:8; signed:0; field:u64 vruntime; offset:40; size:8; signed:0; print fmt: "comm=%s pid=%d runtime=%Lu [ns] vruntime=%Lu [ns]", REC->comm, REC->pid, (unsigned long long)REC->runtime, (unsigned long long)REC->vruntime [root@myQEMU sched_stat_runtime]#
sched_stat_runtime这个tracepoint有八个域,前4个是通用域,后4个是tracepoint支持域;
cd events/sched/sched_stat_runtime echo 'comm ~ "sh*"' > filter ///跟踪所有sh开头的进程 130 echo 'pid == 2288' >> filter ///跟踪PID=2288的进程
支持类似C语言的表达式对时间进行过滤。
[root@myQEMU tracing]# cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=3255632 [ns] vruntime=110771356247804 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=307728 [ns] vruntime=110771356555532 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=359392 [ns] vruntime=110771356914924 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=635824 [ns] vruntime=110771357550748 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=976800 [ns] vruntime=110771358527548 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=681568 [ns] vruntime=110771359209116 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=896000 [ns] vruntime=110771360105116 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=652272 [ns] vruntime=110771360757388 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=795152 [ns] vruntime=110771361552540 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=739952 [ns] vruntime=110771362292492 [ns] */ 0) | /* sched_stat_runtime: comm=sh pid=2288 runtime=812624 [ns] vruntime=110771363105116 [ns] */ 0)
添加一个事件
内核各个子系统已经有大量的tracepoint,如果不够,还可以手工添加一个;
在include/trace/events/sched.h添加一个sched_stat_my_runtime的tracepoint内核提供了宏来方便添加;
DECLARE_EVENT_CLASS(sched_stat_my_runtime, TP_PROTO(struct task_struct *tsk, u64 vruntime), TP_ARGS(tsk, vruntime), TP_STRUCT__entry( __array( char, comm, TASK_COMM_LEN ) __field( pid_t, pid ) __field( u64, vruntime ) ), TP_fast_assign( memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN); __entry->pid = tsk->pid; __entry->vruntime = vruntime; ), TP_printk("comm=%s pid=%d minvruntime=%Lu [ns]", __entry->comm, __entry->pid, (unsigned long long)__entry->vruntime) ); DEFINE_EVENT(sched_stat_my_runtime, sched_stat_my_runtime, TP_PROTO(struct task_struct *tsk, u64 vruntime), TP_ARGS(tsk, vruntime)); [USER@HOSTNAME tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 239/239 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | sh-2287 [000] d... 263.524575: sched_stat_my_runtime: comm=sh pid=2287 minvruntime=5762812735127 [ns] rcu_sched-11 [000] d... 263.526249: sched_stat_my_runtime: comm=rcu_sched pid=11 minvruntime=5762812735127 [ns] rcu_sched-11 [000] d... 263.530214: sched_stat_my_runtime: comm=rcu_sched pid=11 minvruntime=5762812735127 [ns] kworker/0:1-18 [000] d... 263.978131: sched_stat_my_runtime: comm=kworker/0:1 pid=18 minvruntime=5762812735127 [ns] kcompactd0-287 [000] d... 264.010218: sched_stat_my_runtime: comm=kcompactd0 pid=287 minvruntime=5762812735127 [ns] kworker/u2:4-255 [000] d... 264.313158: sched_stat_my_runtime: comm=kworker/u2:4 pid=255 minvruntime=5762812735127 [ns] kworker/u2:4-255 [000] d... 264.313242: sched_stat_my_runtime: comm=kworker/u2:4 pid=255 minvruntime=5762812735127 [ns] sh-2287 [000] d.h. 264.313821: sched_stat_my_runtime: comm=sh pid=2287 minvruntime=5762813304551 [ns] sh-2287 [000] d... 264.314023: sched_stat_my_runtime: comm=sh pid=2287 minvruntime=5762813514503 [ns] kworker/u2:4-255 [000] d... 264.393391: sched_stat_my_runtime: comm=kworker/u2:4 pid=255 minvruntime=5762813514503 [ns]
除了DEFINE_EVENT,还可以使用DEFINE_EVENT_CONDITION()定义一个带条件的tracepoint.
如果要定义多个格式相同的tracepoint,可以用DEFINE_EVENT_CLASS()减少代码量;
注:内核提供了一个tracepoint的例子,samples/trace_events/目录中;可以配置CONFIG_samples,CONFIG_SAMPLE_TRACE_EVENTS,编译成模块,加载测试;
4.4案例:irqs跟踪器
当关闭中断时,CPU不能响应外部事件,如果此时有中断发生,只能在下一次开中断时,才能响应这个外部事件,这段延迟叫“中断延迟”。
ftrace提供irqsoff来跟踪这个延迟;
[root@myQEMU tracing]# echo 0 tracing_on 0 tracing_on [root@myQEMU tracing]# echo 0 > trace [root@myQEMU tracing]# echo 0 options/func-trace ///关闭function-trace可以减少一些延迟 0 options/func-trace [root@myQEMU tracing]# echo irqsoff > current_tracer [root@myQEMU tracing]# echo 1 > tracing_on [root@myQEMU tracing]# echo 0 > tracing_on
跟踪结果:
[root@myQEMU tracing]# cat trace # tracer: irqsoff # # irqsoff latency trace v1.1.5 on 5.13.0-rc3+ # -------------------------------------------------------------------- # latency: 2111 us, #158/158, CPU#0 | (M:server VP:0, KP:0, SP:0 HP:0 #P:1) # ----------------- # | task: sh-2287 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: n_tty_write # => ended at: n_tty_write # # # _------=> CPU# # / _-----=> irqs-off # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| / delay # cmd pid ||||| time | caller # \ / ||||| \ | / sh-2287 0d... 2us+: uart_write <-n_tty_write sh-2287 0d... 14us+: start_backtrace <-return_address sh-2287 0d... 31us+: __uart_start.isra.0 <-uart_write sh-2287 0d... 45us+: pl011_start_tx <-__uart_start.isra.0 sh-2287 0d... 57us+: pl011_tx_chars <-pl011_start_tx sh-2287 0d... 70us+: pl011_tx_char <-pl011_tx_chars sh-2287 0d... 83us+: pl011_read <-pl011_tx_char ... sh-2287 0d... 1843us+: pl011_read <-pl011_tx_char sh-2287 0d... 1861us : pl011_tx_char <-pl011_tx_chars sh-2287 0d... 1869us+: pl011_read <-pl011_tx_char sh-2287 0d... 1889us : uart_write_wakeup <-pl011_tx_chars sh-2287 0d... 1899us : tty_port_tty_wakeup <-uart_write_wakeup sh-2287 0d... 1908us : tty_port_default_wakeup <-tty_port_tty_wakeup sh-2287 0d... 1918us+: tty_port_tty_get <-tty_port_default_wakeup sh-2287 0d... 1983us+: tty_wakeup <-tty_port_default_wakeup sh-2287 0d... 1994us : __wake_up <-tty_wakeup sh-2287 0d... 2004us : __wake_up_common_lock <-__wake_up sh-2287 0d... 2014us : __wake_up_common <-__wake_up_common_lock sh-2287 0d... 2024us : woken_wake_function <-__wake_up_common sh-2287 0d... 2034us : default_wake_function <-woken_wake_function sh-2287 0d... 2044us+: try_to_wake_up <-default_wake_function sh-2287 0d... 2060us+: tty_kref_put <-tty_port_default_wakeup sh-2287 0d... 2075us+: start_backtrace <-return_address sh-2287 0d... 2090us+: start_backtrace <-return_address sh-2287 0d... 2102us+: uart_write <-n_tty_write sh-2287 0d... 2115us+: tracer_hardirqs_on <-n_tty_write sh-2287 0d... 2179us : <stack trace> => n_tty_write => file_tty_write.isra.0 => redirected_tty_write => new_sync_write => vfs_write => ksys_write => __arm64_sys_write => invoke_syscall.constprop.0 => do_el0_svc => el0_svc => el0_sync_handler => el0_sync
每个参数的实际意义,参考官网: https://www.kernel.org/doc/html/latest/trace/ftrace.html
4.5案例:trace marker
trace_marker是一个文件节点,允许用户写入字符串,ftrace会记录写入动作的时间戳;
(1)由于trace_marker是写内存,速度很快,避免串口的长耗时;
(2)可以配合内核的event,跟踪两个trace marker之间,内核的运行状况;
这个可以用来跟踪分析应用程序;
测试代码:
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sys/time.h> #include <sys/unistd.h> #include <stdarg.h> #include <unistd.h> #include <ctype.h> #include <signal.h> static int mark_fd = 1; static __thread char buff[BUFSIZ+1]; static void setup_ftrace_marker(void) { struct stat st; char *files[]={ "/sys/kernel/debug/tracing/trace_marker", "/debug/tracing/trace_marker", "/debugfs/tracing/trace_marker", }; int ret; int i=0; for(i=0; i < (sizeof(files)/sizeof(char*));i++) { ret = stat(files[i], &st); if (ret >=0) { break; } } if (ret >= 0) { mark_fd = open(files[i], O_WRONLY); printf("cannot found the sys tracing.\n"); } return 0; } static void ftrace_write(const char *fmt, ...) { va_list ap; int n; if (mark_fd < 0) return ; va_start(ap, fmt); n = vsnprintf(buff, BUFSIZ, fmt, ap); va_end(ap); write(mark_fd, buff, n); } void sig_handler() { printf("Oops! will exit...\n"); if (mark_fd >= 0) { close(mark_fd); } _exit(0); } int main() { printf("hello\n"); int count = 0; signal(SIGINT,sig_handler); setup_ftrace_marker(); ftrace_write("start program.\n"); while(1) { usleep(300*1000); count++; ftrace_write("===test count=%d\n",count); } return 0; }
执行跟踪:
echo 0 > tracing_on echo 0 > trace echo 1 > tracing_on /mnt/a.out echo 0 > tracing_on cat trace
查看trace记录
[root@myQEMU tracing]# cat trace # tracer: nop # # entries-in-buffer/entries-written: 33/33 #P:1 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | rb_producer-428 [000] .... 125.274713: ring_buffer_producer_thread: Starting ring buffer hammer a.out-2291 [000] .... 125.361620: tracing_mark_write: start program. a.out-2291 [000] .... 125.667714: tracing_mark_write: ===test count=1 a.out-2291 [000] .... 125.970099: tracing_mark_write: ===test count=2 a.out-2291 [000] .... 126.280010: tracing_mark_write: ===test count=3 a.out-2291 [000] .... 126.584222: tracing_mark_write: ===test count=4 a.out-2291 [000] .... 126.886110: tracing_mark_write: ===test count=5 a.out-2291 [000] .... 127.195146: tracing_mark_write: ===test count=6 a.out-2291 [000] .... 127.498109: tracing_mark_write: ===test count=7 a.out-2291 [000] .... 127.802138: tracing_mark_write: ===test count=8 a.out-2291 [000] .... 128.112389: tracing_mark_write: ===test count=9 a.out-2291 [000] .... 128.414082: tracing_mark_write: ===test count=10
PS:实际上android的atrace和Trace类,就是基于trace marker实现的;
4.6案例:wakeup
追踪普通进程从被唤醒到真正得到执行之间的延迟。
echo wakeup > current_tracer echo 1 > tracing_on cat trace |head -40
4.7案例:wakeup-rt
跟踪rt进程,从被唤醒到执行的延迟;
echo 0 > tracing_on echo > trace echo wakeup_rt > current_tracer echo 1 > tracing_on cat trace |head -40
当然也可以用脚本来执行这些过程,结果保存到文本:
#!/bin/sh TRACE_PATH=/sys/kernel/debug/tracing echo nop > $TRACE_PATH/current_tracer echo 0 >$TRACE_PATH/tracing_on echo >$TRACE_PATH/trace echo 0 > $TRACE_PATH/max_graph_depth echo $$ #echo $$ > $TRACE_PATH/set_ftrace_pid echo function_graph > $TRACE_PATH/current_tracer echo 'blk_update_request' > $TRACE_PATH/set_graph_function echo 1 > options/func_stack_trace echo 1 > $TRACE_PATH/tracing_on exec "$@"
实用案例:跟踪某个进程的相关函数
实际调试:如果能把环形队列的数据,写入磁盘,供后续分析,更有现实意义;通过脚本实现?
案例:跟踪指定进程函数调用
#!/bin/bash debugfs=/sys/kernel/debug echo nop > $debugfs/tracing/current_tracer echo 0 > $debugfs/tracing/tracing_on echo `pidof a.out` > $debugfs/tracing/set_ftrace_pid echo function_graph > $debugfs/tracing/current_tracer echo vfs_read > $debugfs/tracing/set_graph_function echo 1 > $debugfs/tracing/tracing_on
案例:当内核log打印输出"min_adj 0"时,保存ftrace信息和kernel log信息到指定目录中
#!/bin/sh mkdir -p /data/ echo 8 > /proc/sys/kernel/printk echo -1000 > /proc/self/oom_score_adj cd /sys/kernel/debug/tracing echo 0 > trcing_on echo 0 > trace echo 1 > /sys/kernel/debug/tracing/events/oom/oom_score_adj_update/enable echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_start/enable echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_shrink_slab_end/enable #开始采集 echo 1 > tracing_on TIMES=0 while true do dmesg |grep "min_adj 0" if [ $? -eq 0] then cat /sys/kernel/debug/tracing/trace > /data/ftrace_.$TIMES dmesg > /data/kmsg.txt.$TIMES let TIMES+=1 dmesg -cat echo > trace fi sleep 2 done
trace_cmd和kernelshark
这一组工具,实际上是将上面ftrace的过程,采集记录下来数据,然后图形展示,提供更直观的分析;
trace_cmd,使用遵循reset->record->stop->report模式,简单说用trace_cmd record采集数据,按"ctrl+c"停止采集,在当前目录生成trace.dat文件;用trace-cmd report可以解析trace.dat文件,不过是文字形式的。
下载编译trace-cmd:
git clone https://github.com/rostedt/trace-cmd.git export CROSS_COMPILE=aarch64-linux-gnu- export ARCH=arm64 make
kernelshark,可以将trace.dat文件图形化,方便观察和分析;
精品文章推荐阅读: