一、简介
CoreBolt 是一种倚天平台的性能优化解决方案。CoreBolt 通过 Coresight 在程序运行时采集程序运行信息,对程序的热代码和冷代码进行区分,并通过 BOLT 对程序进行代码段重排,从而提升程序代码的局部性,减少程序运行过程中由 CPU iCache miss 和 iTLB miss 引发的性能下降,提升程序的整体性能。CoreBolt 方案依赖于 Alibaba Cloud Linux 3 操作系统提供的 Coresight 硬件采集能力和 Alibaba Cloud Compiler 提供的 BOLT 优化 ARM 二进制的能力。关于 Coresight 和 BOLT 的详细介绍可以移步
适用场景
CoreBolt 解决方案依赖倚天硬件功能,优化过程必须在倚天上进行。优化后生成的二进制文件符合 ELF 标准,可以在大部分 ARM 平台上运行。
CoreBolt 方案适用于大部分场景,不同的应用优化效果不同,iCache Miss/iTLB Miss/FrontEnd stall 越高,优化效果越好。
二、Bolt/Coresight使用说明
程序构建
目标程序在构建时候需要对构建脚本做以下修改。
程序构建需要关闭 asan 等 santilizer。
链接器需要额外参数-Wl,--build-id=sha1 -Wl,--emit-relocs
如果编译器是gcc(gcc8及以上)需要加编译参数-fno-reorder-blocks-and-partition。
采样环境
ECS 购买倚天裸金属,使用 Alibaba Cloud Linux 3.2104 LTS 64位 ARM版操作系统,在此文档编写时间 20231222 之后购买的此实例都支持 Coresight 采样。
采样环境应当只用做线下采样使用,应避免在线上环境直接采样。
环境准备
安装驱动
modprobe coresight
modprobe coresight-catu
modprobe coresight-funnel
modprobe coresight-tmc
modprobe coresight-cti
modprobe coresight-replicator
modprobe coresight-etm4x
modprobe coresight-tpiu
下线 64-127 core
#!/bin/sh
for i in $( eval echo {$1..$2} )
do
echo $3 > /sys/bus/cpu/devices/cpu$i/online;
done
sh offline.sh 64 127 0
安装 ACC
yum install -y alibaba-cloud-compiler
perf采样
perf record -e cs_etm//u ./app
更多通过 perf 使用 Coresight 的方法见 《Arm Coresight》
perf data 的储存和转化
perf2bolt将inject.data转成fdata的形式
perf inject -i perf.data -o inj.x.data --itrace=i300000il128
/opt/alibaba-cloud-compiler/bin/perf2bolt -p inj.x.data -o perf.fdata libjvm.so
使用bolt进行优化
aarch64上如果只针对部分函数做bolt需要带 -no-scan
-split-all-cold 通常在采样数据充分的情况下更好
/opt/alibaba-cloud-compiler/bin/llvm-bolt libjvm.so -o libjvm.bolt.so -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -dyno-stats
三、以一个快排为例
代码
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define ARRAY_LEN 30000
static struct timeval tm1;
static inline void start() {
gettimeofday(&tm1, NULL);
}
static inline void stop() {
struct timeval tm2;
gettimeofday(&tm2, NULL);
unsigned long long t = 1000 * (tm2.tv_sec - tm1.tv_sec) +\
(tm2.tv_usec - tm1.tv_usec) / 1000;
printf("%llu ms\n", t);
}
void bubble_sort (int *a, int n) {
int i, t, s = 1;
while (s) {
s = 0;
for (i = 1; i < n; i++) {
if (a[i] < a[i - 1]) {
t = a[i];
a[i] = a[i - 1];
a[i - 1] = t;
s = 1;
}
}
}
}
void sort_array() {
printf("Bubble sorting array of %d elements\n", ARRAY_LEN);
int data[ARRAY_LEN], i;
for(i=0; i<ARRAY_LEN; ++i){
data[i] = rand();
}
bubble_sort(data, ARRAY_LEN);
}
int main(){
start();
sort_array();
stop();
return 0;
}
编译后运行
root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# gcc -Wl,--build-id=sha1 -Wl,--emit-relocs -O3 ++sort.c++ -o ++sort++
root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# ./sort
Bubble sorting array of 30000 elements
939 ms
运行时间为 939 ms
使用 coresight 采集
root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# perf record -m ,16M -e cs_etm//u ++./sort++
Bubble sorting array of 30000 elements
941 ms
[ perf record: Woken up 2 times to write data ]
Warning:
AUX data lost 2 times out of 2!
[ perf record: Captured and wrote 32.012 MB perf.data ]
perf 数据转换成 BOLT 数据,转换时间有时较长
root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# perf inject -i ++perf.data++ -o ++perf.x.data++ --itrace=i300000il64
perf2bolt将inject.data转成fdata的形式
root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# /opt/alibaba-cloud-compiler/bin/perf2bolt -p ++perf.x.data++ -o ++perf.fdata++ ++sort++
PERF2BOLT: Starting data aggregation job for perf.x.data
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: aarch64
BOLT-INFO: BOLT version:
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: creating new program header table at address 0x600000, offset 0x200000
BOLT-INFO: enabling relocation mode
BOLT-INFO: disabling -align-macro-fusion on non-x86 platform
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-INFO: binary build-id is: b9d4933d67e120c60a56b7f96fbf93e5a2961f98
PERF2BOLT: spawning perf job to read buildid list
PERF2BOLT: matched build-id and file name
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 1 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parse branch events...
PERF2BOLT: read 485570 samples and 30968485 LBR entries
PERF2BOLT: 0 samples (0.0%) were ignored
PERF2BOLT: traces mismatching disassembled function contents: 0 (0.0%)
PERF2BOLT: out of range traces involving unknown regions: 61 (0.0%)
PERF2BOLT: waiting for perf mem events collection to finish...
PERF2BOLT: processing branch events...
PERF2BOLT: wrote 15 objects and 0 memory objects to perf.fdata
启用 BOLT 优化
root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# /opt/alibaba-cloud-compiler/bin/llvm-bolt ++sort++ -o ++sort.bolt++ -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -dyno-stats
BOLT-INFO: Target architecture: aarch64
BOLT-INFO: BOLT version:
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: creating new program header table at address 0x600000, offset 0x200000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: disabling -align-macro-fusion on non-x86 platform
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: Simple Rate Report
Simple Rate: 12 / 21 = 57.14%
Simple Profile data Rate: 0 / 0 = nan%
BOLT-INFO: number of removed linker-inserted veneers: 0
BOLT-INFO: 1 out of 15 functions in the binary (6.7%) have non-empty execution profile
BOLT-INFO: basic block reordering modified layout of 1 functions (100.00% of profiled, 4.76% of total)
BOLT-INFO: 0 Functions were reordered by LoopInversionPass
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
15446 : executed forward branches 7274 : taken forward branches 15446 : executed backward branches 15446 : taken backward branches 0 : executed unconditional branches 0 : all function calls 0 : indirect calls 0 : PLT calls 108762 : executed instructions 0 : executed load instructions 0 : executed store instructions 0 : taken jump table branches 0 : taken unknown indirect branches 30892 : total branches 22720 : taken branches 8172 : non-taken conditional branches 22720 : taken conditional branches 30892 : all conditional branches 0 : linker-inserted veneer calls 15446 : executed forward branches (=) 0 : taken forward branches (-100.0%) 15446 : executed backward branches (=) 7274 : taken backward branches (-52.9%) 8043 : executed unconditional branches (+804200.0%) 0 : all function calls (=) 0 : indirect calls (=) 0 : PLT calls (=) 116805 : executed instructions (+7.4%) 0 : executed load instructions (=) 0 : executed store instructions (=) 0 : taken jump table branches (=) 0 : taken unknown indirect branches (=) 38935 : total branches (+26.0%) 15317 : taken branches (-32.6%) 23618 : non-taken conditional branches (+189.0%) 7274 : taken conditional branches (-68.0%) 30892 : all conditional branches (=) 0 : linker-inserted veneer calls (=)
BOLT-INFO: Starting stub-insertion pass
BOLT-INFO: Inserted 0 stubs in the hot area and 0 stubs in the cold area. Shared 0 times, iterated 1 times.
BOLT-INFO: padding code to 0xa00000 to accommodate hot text
BOLT-INFO: setting _end to 0xa00368
BOLT-INFO: setting __hot_start to 0x800000
BOLT-INFO: setting __hot_end to 0x800058
BOLT-INFO: patched build-id (flipped last bit)
运行优化后的二进制
root@iZ2ze8k8g2f1rg3pi03y0rZ ~/bolt# ./sort.bolt
Bubble sorting array of 30000 elements
685 ms
优化效果
在上述例子中,sort 程序被优化了 (941 - 685)/941 = 0.27