分布式通讯优化篇 – IRQ affinity

简介:       在一次C500K性能压测过程中,发现一个问题:8 processor的CPU,负载基本集中在CPU0,并且负载达到70以上,并通过mpstat发现CPU0每秒总中断(%irq+%soft)次数比较高。       基于对此问题的研究,解决和思考,便有了这篇文章,希望大家能够喜欢,也欢迎大家留言讨论。       在正文开始之前,我们先来看两个跟性能相关的基本概念:中断与上线

      在一次C500K性能压测过程中,发现一个问题:8 processor的CPU,负载基本集中在CPU0,并且负载达到70以上,并通过mpstat发现CPU0每秒总中断(%irq+%soft)次数比较高。

      基于对此问题的研究,解决和思考,便有了这篇文章,希望大家能够喜欢,也欢迎大家留言讨论。

      在正文开始之前,我们先来看两个跟性能相关的基本概念:中断与上线文切换(在实际场景中,发现90%以上的同学说不清楚),希望这篇文章能带给你一些帮助,如果有疑问,欢迎交流。


      中断


        Hardware interrupts are used by devices to communicate that they require attention from the operating system. Internally, hardware interrupts are implemented using electronic alerting signals that are sent to the processor from an external device, which is either a part of the computer itself, such as a disk controller, or an external peripheral. For example, pressing a key on the keyboard or moving the mouse triggers hardware interrupts that cause the processor to read the keystroke or mouse position. Unlike the software type (described below), hardware interrupts are asynchronous and can occur in the middle of instruction execution, requiring additional care in programming. The act of initiating a hardware interrupt is referred to as an interrupt request (IRQ).


         A software interrupt is caused either by an exceptional condition in the processor itself, or a special instruction in the instruction set which causes an interrupt when it is executed. The former is often called a trap or exception and is used for errors or events occurring during program execution that are exceptional enough that they cannot be handled within the program itself. For example, if the processor's arithmetic logic unit is commanded to divide a number by zero, this impossible demand will cause a divide-by-zero exception, perhaps causing the computer to abandon the calculation or display an error message. Software interrupt instructions function similarly to subroutine calls and are used for a variety of purposes, such as to request services from low-level system software such as device drivers. For example, computers often use software interrupt instructions to communicate with the disk controller to request data be read or written to the disk.


        硬中断,硬件中断CPU,通常是异步处理的;软中断,指令中断内核执行,分两种情况,一种是异常,另外一种类似subroutine calls,软中断可以用来实现system call。


      上线文切换


        In computing, a context switch is the process of storing and restoring the state (context) of a process or thread so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU and is an essential feature of a multitasking operating system. What constitutes the context is determined by the processor and the operating system.Context switches are usually computationally intensive, and much of the design of operating systems is to optimize the use of context switches. Switching from one process to another requires a certain amount of time for doing the administration – saving and loading registers and memory maps, updating various tables and lists etc. A context switch can mean a register context switch, a task context switch, as tack frame switch, a thread context switch, or a process context switch.

        上下文切换,发生在内核态,其诱因通常是密集型计算。system call仅仅是kernel mode switch,上下文切换有多种表现形式,如进程之间,线程之间,栈帧之间等。举个例子,比方所pidstat -w 统计出来的  cswch/s 和 nvcswch/s 两个指标就是进程维度的。当然,它的实现也是很直观的,尝试一下如下命令,grep ctxt /proc/${process_id}/status,一目了然。

        问题来了,中断和上下文切换之间究竟存在什么样的数理关系?翻阅了很多文献资料,无果而返。最后去check linux kernal代码中关于cs ,%soft和%irq的统计逻辑,发现中断统计的实现是靠Read stats from /proc/interrupts or /proc/softirqs。更详细的实现请参看here。而进程的上下文切换是通过watch /proc/${process_id}/status来实现的。至于两者的数理关系,至少依靠现在的知识体系还无法拿到,欢迎不吝赐教。


        Ok,下面我们来看一下整个亲核优化过程中所需要掌握的基本技巧:RPS/RFS,irqbalance和irq affinity!

      

      RPS/RFS - Receive Package Steering/Receive Flow Steering


        Google同学开发的patch,从2.6.35开始加入到kernel中。简单来说,其原理是利用hash算法来hash TCP或者 UDP的 package header,并根据应用所在的CPU去选择软中断所需要的CPU。文档中有一句话,最能概括它的使用场景,如下。大致意思是说网卡单队列模式以及队列数少于CPU核数的场景下,如果能保证共享内存,用它无疑是最佳神器。

        For a single queue device, a typical RPS configuration would be to set the rps_cpus to the CPUs in the same memory domain of the interrupting CPU. If NUMA locality is not an issue, this could also be all CPUs in the system. At high interrupt rate, it might be wise to exclude the interrupting CPU from the map since that already performs much work. For a multi-queue system, if RSS is configured so that a hardware receive queue is mapped to each CPU, then RPS is probably redundant and unnecessary. If there are fewer hardware queues than CPUs, then RPS might be beneficial if the rps_cpus for each queue are the ones that share the same memory domain as the interrupting CPU for that queue.

        那问题又来了,如何辨别多队列网卡?如何保障共享内存?提供一种思路,对于第一个问题,可以用命令

        lspci -vvv | grep 'Ethernet controller'
        

        如果有MSI-X && Enable+ && TabSize > 1,则该网卡是多队列网卡。对于第二个问题,可以考虑在lscpu的帮助下,将中断绑定到具体的物理CPU上。

      Irqbalance


        手册上是这么说的,distribute hardware interrupts across processors on a multiprocessor system。在SMP体系结构上问题还是蛮多的,可以参看Ubuntu的Bug追踪系统。当然,国内褚霸同学对其源码进行了详细分析,感兴趣的可以也参看这里

       SMP IRQ Affinity


        最后,来看一下kernel 2.4加入的SMP IRQ Affinity:

        An interrupt request (IRQ) is a request for service, sent at the hardware level. Interrupts can be sent by either a dedicated hardware line, or across a hardware bus as an information packet (a Message Signaled Interrupt, or MSI). When interrupts are enabled, receipt of an IRQ prompts a switch to interrupt context. Kernel interrupt dispatch code retrieves the IRQ number and its associated list of registered Interrupt Service Routines (ISRs), and calls each ISR in turn. The ISR acknowledges the interrupt and ignores redundant interrupts from the same IRQ, then queues a deferred handler to finish processing the interrupt and stop the ISR from ignoring future interrupts.

       /proc/interrupts列出了IRQ number, the number of that interrupt handled by each CPU core, the interrupt type, and a comma-delimited list of drivers that are registered to receive that interrupt. (Refer to the proc(5) man page for further details: man 5 proc).


       /proc/irq/IRQ_NUMBER/smp_affinity,smp_affinity是用来描述中断亲和特性的,this property can be used to improve application performance by assigning both interrupt affinity and the application's thread affinity to one or more specific CPU cores. This allows cache line sharing between the specified interrupt and application threads.

       如何验证你的中断亲核性设置是否OK呢?请参看下面的流程:


       a. 查看网卡中断号:

       cat /proc/interrupts

      b. 查看该中断号的cpu affinity:

       sudo cat /proc/irq/42/smp_affinity

      c. 修改绑定:

       sudo echo ff > /proc/irq/42/smp_affinity

      d. 访问特定网站:

       ping -f www.creative.com

      e. 查看中断绑定结果:

       cat /proc/interrupts | grep  'CPU\|42:'

小结:

       在我的多队列网卡中,手动绑定了SMP IRQ Affinity的值,并且排除了其它两种优化方式的干扰,解决掉了开篇提到的性能问题。 但我文章里提到的那个数理关系,还有代进一步挖掘,如果有更多的发现,会及时分享给大家,希望大家能够喜欢!


参考文档:

1. http://en.wikipedia.org/wiki/Interrupt
2. http://en.wikipedia.org/wiki/Context_switch
3. http://wenku.baidu.com/view/315d2c8571fe910ef12df838.html
4. https://www.kernel.org/doc/Documentation/networking/scaling.txt
5. http://kernelnewbies.org/Linux_2_6_35
6. https://cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt
7. http://www.linfo.org/context_switch.html
8. http://lwn.net/Articles/328339/
9. http://lwn.net/Articles/398385/
10. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/network-rps.html
11. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html smp_affinity

12. http://www.linfo.org/context_switch.html

13. http://www.softpanorama.org/Admin/Monitoring/Sar/linux_implementation_of_sar.shtml

14. http://choices.cs.uiuc.edu/ExpCS07.pdf

15. http://sebastien.godard.pagesperso-orange.fr/

16. http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html



目录
相关文章
|
1月前
|
算法
基于粒子群算法的分布式电源配电网重构优化matlab仿真
本研究利用粒子群算法(PSO)优化分布式电源配电网重构,通过Matlab仿真验证优化效果,对比重构前后的节点电压、网损、负荷均衡度、电压偏离及线路传输功率,并记录开关状态变化。PSO算法通过迭代更新粒子位置寻找最优解,旨在最小化网络损耗并提升供电可靠性。仿真结果显示优化后各项指标均有显著改善。
|
27天前
|
存储 缓存 数据处理
深度解析:Hologres分布式存储引擎设计原理及其优化策略
【10月更文挑战第9天】在大数据时代,数据的规模和复杂性不断增加,这对数据库系统提出了更高的要求。传统的单机数据库难以应对海量数据处理的需求,而分布式数据库通过水平扩展提供了更好的解决方案。阿里云推出的Hologres是一个实时交互式分析服务,它结合了OLAP(在线分析处理)与OLTP(在线事务处理)的优势,能够在大规模数据集上提供低延迟的数据查询能力。本文将深入探讨Hologres分布式存储引擎的设计原理,并介绍一些关键的优化策略。
85 0
|
6月前
|
算法 调度
电动汽车集群并网的分布式鲁棒优化调度matlab
电动汽车集群并网的分布式鲁棒优化调度matlab
|
3月前
|
存储 缓存 负载均衡
【PolarDB-X 技术揭秘】Lizard B+tree:揭秘分布式数据库索引优化的终极奥秘!
【8月更文挑战第25天】PolarDB-X是阿里云的一款分布式数据库产品,其核心组件Lizard B+tree针对分布式环境优化,解决了传统B+tree面临的数据分片与跨节点查询等问题。Lizard B+tree通过一致性哈希实现数据分片,确保分布式一致性;智能分区实现了负载均衡;高效的搜索算法与缓存机制降低了查询延迟;副本机制确保了系统的高可用性。此外,PolarDB-X通过自适应分支因子、缓存优化、异步写入、数据压缩和智能分片等策略进一步提升了Lizard B+tree的性能,使其能够在分布式环境下提供高性能的索引服务。这些优化不仅提高了查询速度,还确保了系统的稳定性和可靠性。
89 5
|
3月前
|
机器学习/深度学习 人工智能 负载均衡
【AI大模型】分布式训练:深入探索与实践优化
在人工智能的浩瀚宇宙中,AI大模型以其惊人的性能和广泛的应用前景,正引领着技术创新的浪潮。然而,随着模型参数的指数级增长,传统的单机训练方式已难以满足需求。分布式训练作为应对这一挑战的关键技术,正逐渐成为AI研发中的标配。
192 5
|
3月前
|
机器学习/深度学习 资源调度 PyTorch
面向大规模分布式训练的资源调度与优化策略
【8月更文第15天】随着深度学习模型的复杂度不断提高,对计算资源的需求也日益增长。为了加速训练过程并降低运行成本,高效的资源调度和优化策略变得至关重要。本文将探讨在大规模分布式训练场景下如何有效地进行资源调度,并通过具体的代码示例来展示这些策略的实际应用。
361 1
|
3月前
|
C# UED 定位技术
WPF控件大全:初学者必读,掌握控件使用技巧,让你的应用程序更上一层楼!
【8月更文挑战第31天】在WPF应用程序开发中,控件是实现用户界面交互的关键元素。WPF提供了丰富的控件库,包括基础控件(如`Button`、`TextBox`)、布局控件(如`StackPanel`、`Grid`)、数据绑定控件(如`ListBox`、`DataGrid`)等。本文将介绍这些控件的基本分类及使用技巧,并通过示例代码展示如何在项目中应用。合理选择控件并利用布局控件和数据绑定功能,可以提升用户体验和程序性能。
62 0
|
3月前
|
自然语言处理 Java
自研分布式训练框架EPL问题之实现显存的极致优化如何解决
自研分布式训练框架EPL问题之实现显存的极致优化如何解决
|
3月前
|
SQL 存储 分布式计算
神龙大数据加速引擎MRACC问题之RDMA技术帮助大数据分布式计算优化如何解决
神龙大数据加速引擎MRACC问题之RDMA技术帮助大数据分布式计算优化如何解决
55 0
|
4月前
|
缓存 自然语言处理 负载均衡
理解大模型在分布式系统中的应用和优化策略
理解大模型在分布式系统中的应用和优化策略

热门文章

最新文章

下一篇
无影云桌面