引子
线程可能因为CPU资源不足或者因为--比如等待网络数据--而波动。这在监控上来看,就是业务波动了。但是确定这一点并不容易。
第一个难点是现场难抓。如果是CPU打满或者负载很高,现场复现了,但是可能捕捉数据的线程没有机会执行。如何解决这个问题我们在另一个小技巧中讨论了,这里略过。
第二个难点是使用什么数据来确定线程因为CPU资源波动了。下面我们展开讨论下。
vruntime
Linux 2.6.33引入了CFS调度器,task_strcut也因之加了sched_entity结构。sched_entity结构有一个字段是我们感兴趣的:vruntime。
struct sched_entity {
/* For load-balancing: */
struct load_weight load;
unsigned long runnable_weight;
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime; // 我们要使用的字段
u64 prev_sum_exec_runtime;
u64 nr_migrations;
struct sched_statistics statistics;
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
#ifdef CONFIG_SMP
/*
* Per entity load average tracking.
*
* Put into separate cache line so it does not
* collide with read-mostly values above.
*/
struct sched_avg avg;
#endif
};
vruntime代表的是什么呢?内核文档是这么说的
In CFS the virtual runtime is expressed and tracked via the per-task
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
timestamp and measure the "expected CPU time" a task should have gotten.
[ small detail: on "ideal" hardware, at any time all tasks would have the same
p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
would ever get "out of balance" from the "ideal" share of CPU time. ]
CFS's task picking logic is based on this p->se.vruntime value and it is thus
very simple: it always tries to run the task with the smallest p->se.vruntime
value (i.e., the task which executed least so far). CFS always tries to split
up CPU time between runnable tasks as close to "ideal multitasking hardware" as
possible.
Most of the rest of CFS's design just falls out of this really simple concept,
with a few add-on embellishments like nice levels, multiprocessing and various
algorithm variants to recognize sleepers.
简单说,vruntime代表了线程已经消耗的处理器时间。在理想的硬件上,线程应该有相同的vruntime。
这就是我们的依据。
简单实验
测试脚本和压力工具
我们直接让测试脚本打印 vruntime信息。压力工具则是使用perf工具。
测试脚本如下
#!/bin/bash
export LANG=C
for ((i=0;i<10;i++));do
cat /proc/$$/sched
sleep 1
done
压力工具用法如下
root@pusf:~ perf bench sched messaging -l 10000
综合起来,我们的测试方法如下
./demo > log/1.log; perf bench sched messaging -l 10000 & sleep 1;./demo > log/2.log
结果分析
我们看下得到的结果
nerd@pusf:/tmp$ egrep vruntime log/{1.log,2.log}
log/1.log:se.vruntime : 22075.635863
log/1.log:se.vruntime : 22076.476482
log/1.log:se.vruntime : 22077.746821
log/1.log:se.vruntime : 22080.537902
log/1.log:se.vruntime : 22084.183713
log/1.log:se.vruntime : 22087.243075
log/1.log:se.vruntime : 22098.180655
log/1.log:se.vruntime : 22099.594014
log/1.log:se.vruntime : 22104.294012
log/1.log:se.vruntime : 22108.701587
log/2.log:se.vruntime : 82731.373434
log/2.log:se.vruntime : 83382.975477
log/2.log:se.vruntime : 78933.644191
log/2.log:se.vruntime : 88235.425663
log/2.log:se.vruntime : 93117.891657
log/2.log:se.vruntime : 101234.834622
log/2.log:se.vruntime : 95899.749367
log/2.log:se.vruntime : 115403.719751
log/2.log:se.vruntime : 124388.997744
log/2.log:se.vruntime : 126752.972070
nerd@pusf:/tmp$
可见,vruntime的区别是显著的。