Perf Subsystem —— 基于PMI实现的NMI Watchdog

简介: ## 背景任务能否被及时响应,对内核来说,至关重用。Linux kernel实现了softlockup和hardlockup,用于检测系统是否出现了长时间无响应。> A ‘softlockup’ is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds, with


## 背景


任务能否被及时响应,对内核来说,至关重用。Linux kernel实现了softlockup和hardlockup,用于检测系统是否出现了长时间无响应。


> A ‘softlockup’ is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds, without giving other tasks a chance to run.

>

> A ‘hardlockup’ is defined as a bug that causes the CPU to loop in kernel mode for more than 10 seconds, without letting other interrupts have a chance to run. [1]


网上有很多关于softlockup和hardlockup的基本原理,大多围绕于检测算法的分析。本文聚焦于阐明内核基于Performance Monitoring Interrupt(PMI)实现hardlockup的硬件机制。


## Hardlockup的基本原理


对于hardlockup的检测,内核利用高精度定时器(hrtimer)和不可屏蔽中断(Non-maskable Interrupt,NMI)来检查长时间中断无响应的情况,也称为NMI Watchdog。基本原理是:


- hrtimer周期性的产生时钟中断,中断处理函数更新计数器hrtimer_interrupts

- PMU对cycle事件进行计数,周期性的溢出,触发PMI中断,PMI配置为不可屏蔽模式。因为NMI中断是不可屏蔽的,在CPU不再响应中断的情况下仍然可以得到执行,它再去检查时钟中断的计数器hrtimer_interrupts是否在保持递增,如果停滞就意味着时钟中断未得到响应,也就是发生了hard lockup。如果一个CPU在规定时间内没有响应hrtimer的中断,则意味着hardlockup发生。


初始化的调用逻辑大致如下:


``` c

main()

=> lockup_detector_init()

 => watchdog_nmi_probe()  // step 1

  => hardlockup_detector_perf_init()

   => hardlockup_detector_event_create()

 => lockup_detector_setup()  // step 2

  => lockup_detector_reconfigure()

   => softlockup_start_all()

    => for_each_cpu: softlockup_start_fn()

     => watchdog_enable()


```


- step1: 初始化用于hardlockup的perf事件

- step2: 初始化高精度时钟


## Hardlockup的Perf事件


X86平台很巧妙的利用PMI的NMI模式,实现hardlockup检测。


### 事件的配置


NMI watchdog所需的Perf事件创建逻辑如下:


```c

hardlockup_detector_event_create()

=> perf_event_create_kernel_counter()

=> perf_event_alloc()


static struct perf_event *

perf_event_alloc(struct perf_event_attr *attr, int cpu,

  struct task_struct *task,

  struct perf_event *group_leader,

  struct perf_event *parent_event,

  perf_overflow_handler_t overflow_handler,

  void *context, int cgroup_fd)

```


Perf的事件创建接口`perf_event_alloc`,需要指明事件的属性`attr`,以及事件计数溢出后的回调函数`overflow_handler`。


```c

// kernel/watchdog_hld.c

static struct perf_event_attr wd_hw_attr = {

.type  = PERF_TYPE_HARDWARE,

.config  = PERF_COUNT_HW_CPU_CYCLES,

.size  = sizeof(struct perf_event_attr),

.pinned  = 1,

.disabled = 1,

};

```


NMI Watchdog使用的perf事件的配置,由`wd_hw_attr`指定:


- type: PERF_TYPE_HARDWARE,表明config指定的是硬件事件

- config:PERF_COUNT_HW_CPU_CYCLES,时钟周期数

- pinned:保证counter不会复用


对于Intel和AMD的处理器来说,`PERF_COUNT_HW_CPU_CYCLES`事件被实现为`CPU_CLK_UNHALTED`,既可以使用GP计数器(general-purpose counter),也可以使用FP计数器(fixed-function  counter)。[FP counter](https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-653.html)对应的事件目前就三个:`CPU_CLK_UNHALTED.CORE`, `CPU_CLK_UNHALTED.REF`, `INST_RETIRED.ANY`,即时钟周期数、参考时钟周期数和有效执行的指令计数。


```

// https://man7.org/linux/man-pages/man2/perf_event_open.2.html

  pinned The pinned bit specifies that the counter should always be

   on the CPU if at all possible.  It applies only to

   hardware counters and only to group leaders.  If a pinned

   counter cannot be put onto the CPU (e.g., because there

   are not enough hardware counters or because of a conflict

   with some other event), then the counter goes into an

   'error' state, where reads return end-of-file (i.e.,

   read(2) returns 0) until the counter is subsequently

   enabled or disabled.


```


当系统使用的event多于硬件的counter时,counter就会被时分复用,精度会下降。`pinned`参数保证NMI使用的counter不会被时分复用,从而保证了event的精度。


```c

// https://man7.org/linux/man-pages/man2/perf_event_open.2.html

sample_period, sample_freq

   A "sampling" event is one that generates an overflow

   notification every N events, where N is given by

   sample_period.  A sampling event has sample_period > 0.

   When an overflow occurs, requested data is recorded in the

   mmap buffer.  The sample_type field controls what data is

   recorded on each overflow.

```


### 事件的注册


NMI watchdog的事件的注册代码如下:


``` c

// kernel/watchdog.c

int __read_mostly watchdog_thresh = 10;


// arch/x86/kernel/apic/hw_nmi.c

u64 hw_nmi_get_sample_period(int watchdog_thresh)

{

return (u64)(cpu_khz) * 1000 * watchdog_thresh;   // C3

}


/**

* perf_event_create_kernel_counter

*

* @attr: attributes of the counter to create

* @cpu: cpu in which the counter is bound

* @task: task to profile (NULL for percpu)

* @overflow_handler: callback to trigger when we hit the event

* @context: context data could be used in overflow_handler callback

*/

struct perf_event *

perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,

    struct task_struct *task,

    perf_overflow_handler_t overflow_handler,

    void *context){

// ...

 event = perf_event_alloc(attr, cpu, task, NULL, NULL,

    overflow_handler, context, -1);

}


// kernel/watchdog_hld.c

static int hardlockup_detector_event_create(void)

{

unsigned int cpu = smp_processor_id();

struct perf_event_attr *wd_attr;

struct perf_event *evt;


wd_attr = &wd_hw_attr;  // C1

wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh); // C2


/* Try to register using hardware perf events */

evt = perf_event_create_kernel_counter(wd_attr, cpu, NULL,

        watchdog_overflow_callback, NULL);

if (IS_ERR(evt)) {

 pr_debug("Perf event create on CPU %d failed with %ld\n", cpu,

   PTR_ERR(evt));

 return PTR_ERR(evt);

}

this_cpu_write(watchdog_ev, evt);

return 0;

}


```


- C1: 指定perf_event的attr属性为wd_hw_attr

- C2: 指定`sample_period`指定N个事件后,触发溢出中断。

- C3: 默认配置CPU频率的10位个周期数,即每10秒触发PMI中断。

- C4: 注册PMI中断的callback


至此,我们配置好了NMI watchdog所需要的事件为`PERF_COUNT_HW_CPU_CYCLES`和溢出的阈值`sample_period`,在采样模式下,每10秒计数器发生一下溢出,触发PMI中断。在中断处理的回调函数`watchdog_overflow_callback`中,调用`is_hardlockup`来判断是否发生了hardlockup。


那PMI和NMI又有什么关系呢?下一节,我们深入Intel APIC,说明PMI和NMI的关系。


## NMI


### APIC支持的中断


传统来说,主板上两个主要芯片组,北桥(Northbridge,Memory Controller Hub)和南桥(Southbridge,I/O Controller Hub)。CPU通过南桥和北桥连接其它设备的。其中北桥连接高速设备和南桥,南桥连接低速设备。英特尔在第一代Core i7中把存储器控制器集成到了CPU,Intel IOH(IO Controller Hub)北桥的功能只剩下连接高速设备(如显卡)。从LGA 1156、LGA 2011开始,Intel处理器集成了北桥(内存控制器、高速PCI Express控制器和Intel HD Graphics),作为uncore的一部分,也称为system agent[2]。从此,主板上只剩下南桥,Intel将南桥称为平台路径控制器(PCH)[3]。


在现代计算机中,高级可编程中断控制器(Advanced Programmable Interrupt Controller,APIC)通常由两个部分组成,分别为LAPIC(Local APIC,本地高级可编程中断控制器)[4]和IOAPIC(I/O高级可编程中断控制器)[3]。LAPIC集成在CPU中,IOAPIC通常位于PCH中。


<center>

<img style="border-radius: 0.3125em;

box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"

src="https://ata2-img.oss-cn-zhangjiakou.aliyuncs.com/neweditor/c8294b0a-9f29-45ff-af66-034cf19ab9ae.png"

width="80%">

<br>

<div style="color:orange; border-bottom: 1px solid #d9d9d9;

display: inline-block;

color: #999;

padding: 2px;">Fig 1. APIC </div>

</center>


对于处理器中的LAPIC来讲,主要有两个功能:

- 接收来自处理器引脚中断(例如LINT0和LINT1)、来自LPAIC的内部中断,来自IO APIC的外部中断

- 对于多处理器系统来说,LAPIC还负责发送处理器间中断(interprocessor interrupt ,IPI)


LVT(Local vector table)寄存器允许软件配置以下本地中断的发送模式[4]:


- LVT CMCI Register (FEE0 02F0H) — Specifies interrupt delivery when an overflow condition of corrected

machine check error count reaching a threshold value occurred in a machine check bank supporting CMCI.

- LVT Timer Register (FEE0 0320H) — Specifies interrupt delivery when the APIC timer signals an interrupt

- LVT Thermal Monitor Register (FEE0 0330H) — Specifies interrupt delivery when the thermal sensor

generates an interrupt

- **LVT Performance Counter Register** (FEE0 0340H) — Specifies interrupt delivery when a performance counter generates an interrupt on overflow or when Intel PT signals a ToPA PMI.

- LVT LINT0 Register (FEE0 0350H) — Specifies interrupt delivery when an interrupt is signaled at the LINT0

pin.

- LVT LINT1 Register (FEE0 0360H) — Specifies interrupt delivery when an interrupt is signaled at the LINT1

pin.

- LVT Error Register (FEE0 0370H) — Specifies interrupt delivery when the APIC detects an internal error


<center>

<img style="border-radius: 0.3125em;

box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);"

src="https://ata2-img.oss-cn-zhangjiakou.aliyuncs.com/neweditor/5197cc99-cee9-4f18-bba2-ef82483cd2d0.png"

width="60%">

<br>

<div style="color:orange; border-bottom: 1px solid #d9d9d9;

display: inline-block;

color: #999;

padding: 2px;">Fig 2. Local vector table [4]</div>

</center>


也就是说,PMI是LAPIC支持的一种中断,通过配置LVT Performance Counter Register,LAPIC将以NMI的模式,发送到处理器。


### PMI的内核配置


内核的`APIC_LVTPC 0x340`宏定义了LVT Performance Counter Register,`APIC_DM_NMI 0x00400`定义了Delivery Mode为NMI。


```c

// arch/x86/include/asm/apicdef.h

#define  APIC_DM_FIXED  0x00000

#define  APIC_DM_FIXED_MASK 0x00700

#define  APIC_DM_LOWEST  0x00100

#define  APIC_DM_SMI  0x00200

#define  APIC_DM_REMRD  0x00300

#define  APIC_DM_NMI  0x00400

#define  APIC_DM_INIT  0x00500

#define  APIC_DM_STARTUP  0x00600

#define  APIC_DM_EXTINT  0x00700

#define  APIC_VECTOR_MASK 0x000FF

#define APIC_ICR2 0x310

#define  GET_APIC_DEST_FIELD(x) (((x) >> 24) & 0xFF)

#define  SET_APIC_DEST_FIELD(x) ((x) << 24)

#define APIC_LVTT 0x320

#define APIC_LVTTHMR 0x330

#define APIC_LVTPC 0x340

#define APIC_LVT0 0x350

```


内核在Intel相关的硬件perf初始化中,指定了Performance Monitoring Interrupt(PMI中断)为NMI。Perf的初始化请参考[5]。


```c

// arch/x86/events/core.c


void perf_events_lapic_init(void)

{

if (!x86_pmu.apic || !x86_pmu_initialized())

 return;


/*

 * Always use NMI for PMU

 */

apic_write(APIC_LVTPC, APIC_DM_NMI);

}


static int __init init_hw_perf_events(void)

{

intel_pmu_init();


pmu_check_apic();


perf_events_lapic_init();   // C1

register_nmi_handler(NMI_LOCAL, perf_event_nmi_handler, 0, "PMI");  // C2


x86_pmu_show_pmu_cap(x86_pmu.num_counters,

     x86_pmu.num_counters_fixed,

     x86_pmu.intel_ctrl);


perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW);

}

early_initcall(init_hw_perf_events);

```


- C1: 注册PMU的中断模式 为 NMI 中断

- C2: PMI的中断源类型为`NMI_LOCAL`,NMI action为perf_event_nmi_handler


`NMI_LOCAL`为来自LAPIC的NMI,只能被LAPIC所在的处理器处理和发现。除了`NMI_LOCAL`之外,内核还支持处理来自PCH的PCI SERR# `NMI_SERR`,IOCHK# `NMI_IO_CHECK`,由NMI_STS_CNT寄存器控制,这两种NMI可以被任意处理器处理。


## Trace NMI


### NMI interrupt


查看NMI watchdog的状态


```bash

# sysctl kernel.nmi_watchdog

kernel.nmi_watchdog = 1

```


如果没有打开,则需要执行命令打开watchdog


``` bash

# sysctl -w kernel.nmi_watchdog=1

```


使能NMI watchdog后,检查interrupts次数


``` bash

# grep NMI /proc/interrupts

```


### Trace watchdog的回调函数


使用`perf probe `定义`watchdog_overflow_callback`函数的动态tracepoint,对触发的NMI进行采样,观测NMI处理的调用栈,`watchdog_thresh=10`,因此,我们这里指定采样事件为11,保证采样周期内有NMI触发。关于`perf probe `的使用技巧,日后补充系列文章[link]()。


``` bash

# perf probe -a watchdog_overflow_callback

Added new event:

 probe:watchdog_overflow_callback (on watchdog_overflow_callback)


You can now use it in all perf tools, such as:


perf record -e probe:watchdog_overflow_callback -aR sleep 1


# perf record -e probe:watchdog_overflow_callback -agR -- sleep 11

[ perf record: Woken up 1 times to write data ]

[ perf record: Captured and wrote 1.473 MB perf.data (1 samples) ]


# perf script

ps 127715 [009] 180101.406729: probe:watchdog_overflow_callback: (ffffffffa6197670)

ffffffffa6197671 watchdog_overflow_callback+0x1 (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

ffffffffa622101f __perf_event_overflow+0x4f (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

ffffffffa600de6d handle_pmi_common+0x1fd (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

ffffffffa600e42d intel_pmu_handle_irq+0xed (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

ffffffffa6004d48 perf_event_nmi_handler+0x28 (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

ffffffffa602e932 nmi_handle+0x52 (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

ffffffffa6a2e1a2 default_do_nmi+0x42 (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

ffffffffa6a2e3af exc_nmi+0x11f (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

ffffffffa6c0132e asm_exc_nmi+0x8e (/usr/lib/debug/lib/modules/5.10.84-004.alpha.ali5000.alios7.x86_64/vmlinux)

 7f585c23ff68 [unknown] (/usr/lib64/libprocps.so.4.0.0)

     0 [unknown] ([unknown])

```


调用栈如下:


```c

default_do_nmi()

=> nmi_handle(NMI_LOCAL, regs);

 => thishandled = a->handler(type, regs); => perf_event_nmi_handler()

 => ret = static_call(x86_pmu_handle_irq)(regs);

 => intel_pmu_handle_irq()

  => handle_pmi_common()

  => perf_event_overflow()

   => __perf_event_overflow()

   => event->overflow_handler() => watchdog_overflow_callback()

```


## 写在最后


疏漏之处,尽请指正:)


## 参考文献


[1] https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt

[2] https://zh.wikipedia.org/wiki/%E5%8C%97%E6%A1%A5

[3] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/6th-gen-core-pch-u-y-io-datasheet-vol-2.pdf

[4] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

[5][https://blog.csdn.net/Rong_Toa/article/details/116982544](https://blog.csdn.net/Rong_Toa/article/details/116982544)

目录
相关文章
|
Linux 虚拟化 监控
PERF EVENT 硬件篇
简介 本文将通过以 X86 为例子介绍硬件 PMU 如何为 linux kernel perf_event 子系统提供硬件性能采集功能 理解硬件 MSR (Model Specify Register) 可以理解为CPU硬件的专用寄存器,下述的所有寄存器都是这个类型 汇编指令 rdmsr/wrm.
3941 0
|
4月前
|
前端开发 Linux 调度
ftrace、perf、bcc、bpftrace、ply的使用
ftrace、perf、bcc、bpftrace、ply的使用
|
4月前
|
Linux
将perf跟funcgraph-retval结合起来使用
将perf跟funcgraph-retval结合起来使用
|
4月前
crash命令 —— irq
crash命令 —— irq
|
4月前
crash命令 —— mach
crash命令 —— mach
|
4月前
|
Go
early kdump
early kdump
|
数据可视化 Linux 调度
译 | Linux perf_events Off-CPU Time Flame Graph
译 | Linux perf_events Off-CPU Time Flame Graph
125 0
|
监控 Unix Linux
Linux Performance Monitoring with Vmstat and Iostat Commands
Linux Performance Monitoring with Vmstat and Iostat Commands
285 0