cgroupv2(下)

简介: cgroupv2(下)

5.3 IO


The “io” controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS limit distribution; however, weight based distribution is available only if cfq-iosched is in use and neither scheme is available for blk-mq devices.

5.3.1 IO接口文件
  • io.stat
    A read-only nested-keyed file.
    Lines are keyed by MIN device numbers and not ordered. The following nested keys are defined.
键值 描述
rbytes Bytes read
wbytes Bytes written
rios Number of read IOs
wios Number of write IOs
dbytes Bytes discarded
dios Number of discard IOs
  • An example read output follows:
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
  • io.cost.qos
    A read-write nested-keyed file with exists only on the root cgroup.
    This file configures the Quality of Service of the IO cost model based controller (CONFIG_BLK_CGROUP_IOCOST) which currently implements “io.weight” proportional control. Lines are keyed by MIN device numbers and not ordered. The line for a given device is populated on the first write for the device on “io.cost.qos” or “io.cost.model”. The following nested keys are defined.
键值 描述
enable Weight-based control enable
ctrl “auto” or “user”
rpct Read latency percentile [0, 100]
rlat Read latency threshold
wpct Write latency percentile [0, 100]
wlat Write latency threshold
min Minimum scaling percentage [1, 10000]
max Maximum scaling percentage [1, 10000]
  • The controller is disabled by default and can be enabled by setting “enable” to 1. “rpct” and “wpct” parameters default to zero and the controller uses internal device saturation state to adjust the overall IO rate between “min” and “max”.
    When a better control quality is needed, latency QoS parameters can be configured. For example:
8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
  • shows that on sdb, the controller is enabled, will consider the device saturated if the 95th percentile of read completion latencies is above 75ms or write 150ms, and adjust the overall IO issue rate between 50% and 150% accordingly.
    The lower the saturation point, the better the latency QoS at the cost of aggregate bandwidth. The narrower the allowed adjustment range between “min” and “max”, the more conformant to the cost model the IO behavior. Note that the IO issue base rate may be far off from 100% and setting “min” and “max” blindly can lead to a significant loss of device capacity or control quality. “min” and “max” are useful for regulating devices which show wide temporary behavior changes - e.g. a ssd which accepts writes at the line speed for a while and then completely stalls for multiple seconds.
    When “ctrl” is “auto”, the parameters are controlled by the kernel and may change automatically. Setting “ctrl” to “user” or setting any of the percentile and latency parameters puts it into “user” mode and disables the automatic changes. The automatic mode can be restored by setting “ctrl” to “auto”.
  • io.cost.model
    A read-write nested-keyed file with exists only on the root cgroup.
    This file configures the cost model of the IO cost model based controller (CONFIG_BLK_CGROUP_IOCOST) which currently implements “io.weight” proportional control. Lines are keyed by MIN device numbers and not ordered. The line for a given device is populated on the first write for the device on “io.cost.qos” or “io.cost.model”. The following nested keys are defined.
键值 描述
ctrl “auto” or “user”
model The cost model in use - “linear”
  • When “ctrl” is “auto”, the kernel may change all parameters dynamically. When “ctrl” is set to “user” or any other parameters are written to, “ctrl” become “user” and the automatic changes are disabled.
    When “model” is “linear”, the following model parameters are defined.
键值 描述
`[r w]bps`
`[r w]seqiops`
`[r w]randiops`
  • From the above, the builtin linear model determines the base costs of a sequential and random IO and the cost coefficient for the IO size. While simple, this model can cover most common device classes acceptably.
    The IO cost model isn’t expected to be accurate in absolute sense and is scaled to the device behavior dynamically.
    If needed, tools/cgroup/iocost_coef_gen.py can be used to generate device-specific coefficients.
  • io.weight
    A read-write flat-keyed file which exists on non-root cgroups. The default is “default 100”.
    The first line is the default weight applied to devices without specific override. The rest are overrides keyed by MIN device numbers and not ordered. The weights are in the range [1, 10000] and specifies the relative amount IO time the cgroup can use in relation to its siblings.
    The default weight can be updated by writing either “default WEIGHT”. Overrides can be set by writing “MIN MAJ:$MIN default”.
    An example read output follows:
default 100
8:16 200
8:0 50
  • io.max
    A read-write nested-keyed file which exists on non-root cgroups.
    BPS and IOPS based IO limit. Lines are keyed by MIN device numbers and not ordered. The following nested keys are defined.
键值 描述
rbps Max read bytes per second
wbps Max write bytes per second
riops Max read IO operations per second
wiops Max write IO operations per second
  • When writing, any number of nested key-value pairs can be specified in any order. “max” can be specified as the value to remove a specific limit. If the same key is specified multiple times, the outcome is undefined.
    BPS and IOPS are measured in each IO direction and IOs are delayed if limit is reached. Temporary bursts are allowed.
    Setting read limit at 2M BPS and write at 120 IOPS for 8:16:
echo "8:16 rbps=2097152 wiops=120" > io.max
  • Reading returns the following:
8:16 rbps=2097152 wbps=max riops=max wiops=120
  • Write IOPS limit can be removed by writing the following:
echo "8:16 wiops=max" > io.max
  • Reading now returns the following:
8:16 rbps=2097152 wbps=max riops=max wiops=max
  • io.pressure
    A read-only nested-key file which exists on non-root cgroups.
    Shows pressure stall information for IO. See PSI - Pressure Stall Information for details.
5.3.2 回写(writeback

Page cache is dirtied through buffered writes and shared mmaps and written asynchronously to the backing filesystem by the writeback mechanism. Writeback sits between the memory and IO domains and regulates the proportion of dirty memory by balancing dirtying and write IOs.

The io controller, in conjunction with the memory controller, implements control of page cache writeback IOs. The memory controller defines the memory domain that dirty memory ratio is calculated and maintained for and the io controller defines the io domain which writes out dirty pages for the memory domain. Both system-wide and per-cgroup dirty memory states are examined and the more restrictive of the two is enforced.

cgroup writeback requires explicit support from the underlying filesystem. Currently, cgroup writeback is implemented on ext2, ext4, btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are attributed to the root cgroup.

There are inherent differences in memory and writeback management which affects how cgroup ownership is tracked. Memory is tracked per page while writeback per inode. For the purpose of writeback, an inode is assigned to a cgroup and all IO requests to write dirty pages from the inode are attributed to that cgroup.

As cgroup ownership for memory is tracked per page, there can be pages which are associated with different cgroups than the one the inode is associated with. These are called foreign pages. The writeback constantly keeps track of foreign pages and, if a particular foreign cgroup becomes the majority over a certain period of time, switches the ownership of the inode to that cgroup.

While this model is enough for most use cases where a given inode is mostly dirtied by a single cgroup even when the main writing cgroup changes over time, use cases where multiple cgroups write to a single inode simultaneously are not supported well. In such circumstances, a significant portion of IOs are likely to be attributed incorrectly. As memory controller assigns page ownership on the first use and doesn’t update it until the page is released, even if writeback strictly follows page ownership, multiple cgroups dirtying overlapping areas wouldn’t work as expected. It’s recommended to avoid such usage patterns.

The sysctl knobs which affect writeback behavior are applied to cgroup writeback as follows.

  • vm.dirty_background_ratio, vm.dirty_ratio
    These ratios apply the same to cgroup writeback with the amount of available memory capped by limits imposed by the memory controller and system-wide clean memory.
  • vm.dirty_background_bytes, vm.dirty_bytes
    For cgroup writeback, this is calculated into ratio against total available memory and applied the same way as vm.dirty[_background]_ratio.
5.3.3 IO延迟(IO Latency

This is a cgroup v2 controller for IO workload protection. You provide a group with a latency target, and if the average latency exceeds that target the controller will throttle any peers that have a lower latency target than the protected workload.

The limits are only applied at the peer level in the hierarchy. This means that in the diagram below, only groups A, B, and C will influence each other, and groups D and F will influence each other. Group G will influence nobody::

[root]
      /       |       \
      A       B        C
     /  \     |
    D    F    G

So the ideal way to configure this is to set io.latency in groups A, B, and C. Generally you do not want to set a value lower than the latency your device supports. Experiment to find the value that works best for your workload. Start at higher than the expected latency for your device and watch the avg_lat value in io.stat for your workload group to get an idea of the latency you see during normal operation. Use the avg_lat value as a basis for your real setting, setting at 10-15% higher than the value in io.stat.

5.3.4 IO延迟节流如何工作

io.latency is work conserving; so as long as everybody is meeting their latency target the controller doesn’t do anything. Once a group starts missing its target it begins throttling any peer group that has a higher target than itself. This throttling takes 2 forms:

  • Queue depth throttling. This is the number of outstanding IO’s a group is allowed to have. We will clamp down relatively quickly, starting at no limit and going all the way down to 1 IO at a time.
  • Artificial delay induction. There are certain types of IO that cannot be throttled without possibly adversely affecting higher priority groups. This includes swapping and metadata IO. These types of IO are allowed to occur normally, however they are “charged” to the originating group. If the originating group is being throttled you will see the use_delay and delay fields in io.stat increase. The delay value is how many microseconds that are being added to any process that runs in this group. Because this number can grow quite large if there is a lot of swapping or metadata IO occurring we limit the individual delay events to 1 second at a time.

Once the victimized group starts meeting its latency target again it will start unthrottling any peer groups that were throttled previously. If the victimized group simply stops doing IO the global counter will unthrottle appropriately.

5.3.5 IO延迟接口文件
  • io.latency
    This takes a similar format as the other controllers.
“MAJOR:MINOR target=<target time in microseconds”
  • io.stat
    If the controller is enabled you will see extra stats in io.stat in addition to the normal ones.
  • depth
    This is the current queue depth for the group.
  • avg_lat
    This is an exponential moving average with a decay rate of 1/exp bound by the sampling interval. The decay rate interval can be calculated by multiplying the win value in io.stat by the corresponding number of samples based on the win value.
  • win
    The sampling window size in milliseconds. This is the minimum duration of time between evaluation events. Windows only elapse with IO activity. Idle periods extend the most recent window.


5.4 PID


The process number controller is used to allow a cgroup to stop any new tasks from being fork()’d or clone()’d after a specified limit is reached.

The number of tasks in a cgroup can be exhausted in ways which other controllers cannot prevent, thus warranting its own controller. For example, a fork bomb is likely to exhaust the number of tasks before hitting memory restrictions.

Note that PIDs used in this controller refer to TIDs, process IDs as used by the kernel.

5.4.1 PID接口文件
  • pids.max
    A read-write single value file which exists on non-root cgroups. The default is “max”.
    Hard limit of number of processes.
  • pids.current
    A read-only single value file which exists on all cgroups.
    The number of processes currently in the cgroup and its descendants.

Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough processes to the cgroup such that pids.current is larger than pids.max. However, it is not possible to violate a cgroup PID policy through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated.


5.5 Cpuset


The “cpuset” controller provides a mechanism for constraining the CPU and memory node placement of tasks to only the resources specified in the cpuset interface files in a task’s current cgroup. This is especially valuable on large NUMA systems where placing jobs on properly sized subsets of the systems with careful processor and memory placement to reduce cross-node memory access and contention can improve overall system performance.

The “cpuset” controller is hierarchical. That means the controller cannot use CPUs or memory nodes not allowed in its parent.

5.5.1 cpuset接口文件
  • cpuset.cpus
    A read-write multiple values file which exists on non-root cpuset-enabled cgroups.
    It lists the requested CPUs to be used by tasks within this cgroup. The actual list of CPUs to be granted, however, is subjected to constraints imposed by its parent and can differ from the requested CPUs.
    The CPU numbers are comma-separated numbers or ranges. For example:
# cat cpuset.cpus
0-4,6,8-10
  • An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty “cpuset.cpus” or all the available CPUs if none is found.
    The value of “cpuset.cpus” stays constant until the next update and won’t be affected by any CPU hotplug events.
  • cpuset.cpus.effective
    A read-only multiple values file which exists on all cpuset-enabled cgroups.
    It lists the onlined CPUs that are actually granted to this cgroup by its parent. These CPUs are allowed to be used by tasks within the current cgroup.
    If “cpuset.cpus” is empty, the “cpuset.cpus.effective” file shows all the CPUs from the parent cgroup that can be available to be used by this cgroup. Otherwise, it should be a subset of “cpuset.cpus” unless none of the CPUs listed in “cpuset.cpus” can be granted. In this case, it will be treated just like an empty “cpuset.cpus”.
    Its value will be affected by CPU hotplug events.
  • cpuset.mems
    A read-write multiple values file which exists on non-root cpuset-enabled cgroups.
    It lists the requested memory nodes to be used by tasks within this cgroup. The actual list of memory nodes granted, however, is subjected to constraints imposed by its parent and can differ from the requested memory nodes.
    The memory node numbers are comma-separated numbers or ranges. For example:
# cat cpuset.mems
0-1,3
  • An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty “cpuset.mems” or all the available memory nodes if none is found.
    The value of “cpuset.mems” stays constant until the next update and won’t be affected by any memory nodes hotplug events.
  • cpuset.mems.effective
    A read-only multiple values file which exists on all cpuset-enabled cgroups.
    It lists the onlined memory nodes that are actually granted to this cgroup by its parent. These memory nodes are allowed to be used by tasks within the current cgroup.
    If “cpuset.mems” is empty, it shows all the memory nodes from the parent cgroup that will be available to be used by this cgroup. Otherwise, it should be a subset of “cpuset.mems” unless none of the memory nodes listed in “cpuset.mems” can be granted. In this case, it will be treated just like an empty “cpuset.mems”.
    Its value will be affected by memory nodes hotplug events.
  • cpuset.cpus.partition
    A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable.
    It accepts only the following input values when written to.
    “root” - a partition root “member” - a non-root member of a partition
    When set to be a partition root, the current cgroup is the root of a new partition or scheduling domain that comprises itself and all its descendants except those that are separate partition roots themselves and their descendants. The root cgroup is always a partition root.
    There are constraints on where a partition root can be set. It can only be set in a cgroup if all the following conditions are true.
    Setting it to partition root will take the CPUs away from the effective CPUs of the parent cgroup. Once it is set, this file cannot be reverted back to “member” if there are any child cgroups with cpuset enabled.
    A parent partition cannot distribute all its CPUs to its child partitions. There must be at least one cpu left in the parent partition.
    Once becoming a partition root, changes to “cpuset.cpus” is generally allowed as long as the first condition above is true, the change will not take away all the CPUs from the parent partition and the new “cpuset.cpus” value is a superset of its children’s “cpuset.cpus” values.
    Sometimes, external factors like changes to ancestors’ “cpuset.cpus” or cpu hotplug can cause the state of the partition root to change. On read, the “cpuset.sched.partition” file can show the following values.
    “member” Non-root member of a partition “root” Partition root “root invalid” Invalid partition root
    It is a partition root if the first 2 partition root conditions above are true and at least one CPU from “cpuset.cpus” is granted by the parent cgroup.
    A partition root can become invalid if none of CPUs requested in “cpuset.cpus” can be granted by the parent cgroup or the parent cgroup is no longer a partition root itself. In this case, it is not a real partition even though the restriction of the first partition root condition above will still apply. The cpu affinity of all the tasks in the cgroup will then be associated with CPUs in the nearest ancestor partition.
    An invalid partition root can be transitioned back to a real partition root if at least one of the requested CPUs can now be granted by its parent. In this case, the cpu affinity of all the tasks in the formerly invalid partition will be associated to the CPUs of the newly formed partition. Changing the partition state of an invalid partition root to “member” is always allowed even if child cpusets are present.
  • The “cpuset.cpus” is not empty and the list of CPUs are exclusive, i.e. they are not shared by any of its siblings.
  • The parent cgroup is a partition root.
  • The “cpuset.cpus” is also a proper subset of the parent’s “cpuset.cpus.effective”.
  • There is no child cgroups with cpuset enabled. This is for eliminating corner cases that have to be handled if such a condition is allowed.


5.6 设备控制器


设备控制器管理对设备文件的访问,包括创建新的设备文件(使用mknod),访问已存在的设备文件。

cgroupv2设备控制器没有接口文件,是在cgroup BPF之上实现的。为了控制对设备文件的访问,用户需要创建BPF_CGROUP_DEVICE类型的bpf程序,并将其附加到对应的cgroup上。一旦尝试访问某个设备文件,对应的BPF程序就会被执行,依赖这个返回值,访问是否成功还是失败(-EPERM)。

BPF_CGROUP_DEVICE类型的程序会接受一个bpf_cgroup_dev_ctx数据类型的指针,其描述了尝试访问的设备:访问类型(mknod/read/write)和设备(typemajorminor)。如果程序返回0,则访问失败并返回-EPERM,否则访问成功。

在内核源代码目录下有一个BPF_CGROUP_DEVICE程序的示例,tools/testing/selftests/bpf/dev_cgroup.c


5.7 RDMA


The “rdma” controller regulates the distribution and accounting of RDMA resources.

5.7.1 RDMA接口文件
  • rdma.max
    A readwrite nested-keyed file that exists for all the cgroups except root that describes current configured resource limit for a RDMA/IB device.
    Lines are keyed by device name and are not ordered. Each line contains space separated resource name and its configured limit that can be distributed.
    The following nested keys are defined.
键值 描述
hca_handle Maximum number of HCA Handles
hca_object Maximum number of HCA Objects
  • An example for mlx4 and ocrdma device follows:
mlx4_0 hca_handle=2 hca_object=2000
ocrdma1 hca_handle=3 hca_object=max
  • rdma.current
    A read-only file that describes current resource usage. It exists for all the cgroup except root.
    An example for mlx4 and ocrdma device follows:
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23

5.8 HugeTLB


The HugeTLB controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault.

5.8.1 HugeTLB接口文件
  • hugetlb.<hugepagesize>.current
    Show current usage for “hugepagesize” hugetlb. It exists for all the cgroup except root.
  • hugetlb.<hugepagesize>.max
    Set/show the hard limit of “hugepagesize” hugetlb usage. The default value is “max”. It exists for all the cgroup except root.
  • hugetlb.<hugepagesize>.events
    A read-only flat-keyed file which exists on non-root cgroups.
  • max
    The number of allocation failure due to HugeTLB limit
  • hugetlb.<hugepagesize>.events.local
    Similar to hugetlb..events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events.


5.9 Misc

5.9.1 perf_event

perf_event controller, if not mounted on a legacy hierarchy, is automatically enabled on the v2 hierarchy so that perf events can always be filtered by cgroup v2 path. The controller can still be moved to a legacy hierarchy after v2 hierarchy is populated.


5.10 非规范化的信息


This section contains information that isn’t considered to be a part of the stable kernel API and so is subject to change.

5.10.1 CPU controller root cgroup process behaviour

When distributing CPU cycles in the root cgroup each thread in this cgroup is treated as if it was hosted in a separate child cgroup of the root cgroup. This child cgroup weight is dependent on its thread nice level.

For details of this mapping see sched_prio_to_weight array in kernel/sched/core.c file (values from this array should be scaled appropriately so the neutral - nice 0 - value is 100 instead of 1024).

5.10.2 IO controller root cgroup process behaviour

Root cgroup processes are hosted in an implicit leaf child node. When distributing IO resources this implicit child node is taken into account as if it was a normal child cgroup of the root cgroup with a weight value of 200.


6 命名空间


容器环境中用cgroup和其它一些namespace来隔离进程,但/proc/$PID/cgroup文件可能会泄露潜在的系统层信息。例如:

$ cat /proc/self/cgroup
0::/batchjobs/container_id1 # <-- cgroup 的绝对路径,属于系统层信息,不希望暴露给隔离的进程

因此引入了 cgroup namespace,以下简写为cgroupns(类似于network namespace简写为netns)。


6.1 基础


cgroup namespace提供了一种机制,用来虚拟/proc/$PID/cgroup文件和cgroup挂载点的视角。clone(2)unshare(2)系统调用,使用CLONE_NEWCGROUP克隆标志,可以创建新的cgroup namespace命名空间。运行在该命名空间中的进程将/proc/$PID/cgroup输出限制为cgroupns root根目录。cgroupns root是创建cgroup namespace的进程所在的cgroup

如果没有cgroup namespace/proc/$PID/cgroup文件显示的是进程所属的cgroup的完整路径。比如,我们在配置容器时,会设置一组cgroupnamespaces,用来隔离进程,但是,/proc/$PID/cgroup可能会向隔离进程泄露潜在的系统信息。例如:

# cat /proc/self/cgroup
0::/batchjobs/container_id1

路径信息/batchjobs/container_id1被认为是系统数据,不想暴露给隔离的进程们。cgroup namespace就是用来隐匿这种路径信息的方法。例如,在没有创建cgroup namespace之前:

# ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
# cat /proc/self/cgroup
0::/batchjobs/container_id1

在创建一个新的命名空间并取消共享后,我们只能看到根路径:

# ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
# cat /proc/self/cgroup
0::/

对于多线程的进程,任何一个线程通过unshare创建新cgroupns时,整个进程(所有线程)都会进入到新的cgroupns。这对v2 hierarchy是很自然的事情,但对v1来说,可能是不期望的行为。

cgroup namespace的生命周期:只要满足下面两个条件之一,该命名空间就处于激活状态中:

  • 命名空间中还有进程存活
  • 挂载的文件系统中,还有对象固定到这个cgroup namespace

当其中最后一个进程退出,或挂载的文件系统卸载后,cgroup namespace就会被销毁,但cgroupns root和真实的cgroup仍然存在。


6.2 The Root and Views


cgroup namespace的根空间cgroupns root就是调用unshare(2)的进程所在的cgroup。例如,/batchjobs/container_id1这个cgroup中的进程调用了unshare,那么/batchjobs/container_id1就成了cgroupns root根空间。对于init_cgroup_ns,这是真正的root/cgroup

即使创建命名空间的进程迁移到了另外一个cgroup中,cgroupns root cgroup也不会发生改变:

# ~/unshare -c # 在某个cgroup中取消命名空间的共享
# cat /proc/self/cgroup
0::/
# mkdir sub_cgrp_1
# echo 0 > sub_cgrp_1/cgroup.procs
# cat /proc/self/cgroup
0::/sub_cgrp_1

每个进程都会得到特定的命名空间视角,可以通过/proc/$PID/cgroup查看。

运行在命名空间中的进程,只能看到它们root cgroup。对于一个取消共享的cgroup命名空间:

# sleep 100000 &
[1] 7353
# echo 7353 > sub_cgrp_1/cgroup.procs
# cat /proc/7353/cgroup
0::/sub_cgrp_1

对于初始cgroup命名空间,能够看到真正的cgroup路径:

$ cat /proc/7353/cgroup
0::/batchjobs/container_id1/sub_cgrp_1

在兄弟cgroup命名空间中(也就是说,另一个起源于不同cgroup的命名空间),将会显示相对于它自己的根命名空间的相对cgroup路径。例如,进程(7353)的根命名空间是/batchjobs/container_id2,则它看到的会是:

# cat /proc/7353/cgroup
0::/../container_id2/sub_cgrp_1

相对路径都是以/开头,这是表示它是相对于调用者根命名空间的。


6.3 在 cgroupns 之间迁移进程


cgroup命名空间中的进程,如果能够正确访问外部的cgroup,则能够迁入或迁出命名空间的root位置。假设命名空间的root位置在/batchjobs/container_id1,并且能够从该命名空间访问全局hierarchy

# cat /proc/7353/cgroup
0::/sub_cgrp_1
# echo 7353 > batchjobs/container_id2/cgroup.procs
# cat /proc/7353/cgroup
0::/../container_id2

注意,这类配置不鼓励。cgroup命名空间内的任务只应该暴露给它自己的cgroupns hierarchy

还可以使用setns(2)将进程迁移到其它cgroup命名空间中,前提条件是:

  • 进程对自己当前用户的命名空间具有CAP_SYS_ADMIN能力
  • 进程对目标cgroup命名空间中的用户命名空间具有CAP_SYS_ADMIN能力

将一个进程附加到另一个cgroup命名空间中不会发生隐含的cgroup改变。预期是将要附加的进程迁移到目标cgroup的命名空间的root位置。


6.4 与其它命名空间的交互


具体命名空间的cgroup hierarchy可以由non-initcgroup命名空间内的进程挂载

# mount -t <fstype> <device> <dir>
$ mount -t cgroup2 none $MOUNT_POINT

这将会挂载默认的unified cgroup hierarchy(可以理解为文件系统),这会把cgroup命名空间的root位置作为文件系统的root节点。这个操作需要用户和挂载命名空间具有CAP_SYS_ADMIN权限能力。

/proc/self/cgroup文件的虚拟化,加上限制cgroup hierarchy的视角(通过命名空间私有的cgroupfs挂载),实现了容器内部一个独立的cgroup视角。


7 内核编程的信息


This section contains kernel programming information in the areas where interacting with cgroup is necessary. cgroup core and controllers are not covered.

7.1 Filesystem Support for Writeback


A filesystem can support cgroup writeback by updating address_space_operations->writepages to annotate bio’s using the following two functions.

  • wbc_init_bio(@wbc, @bio)
    Should be called for each bio carrying writeback data and associates the bio with the inode’s owner cgroup and the corresponding request queue. This must be called after a queue (device) has been associated with the bio and before submission.
  • wbc_account_cgroup_owner(@wbc, @page, @bytes)
    Should be called for each data segment being written out. While this function doesn’t care exactly when it’s called during the writeback session, it’s the easiest and most natural to call it as data segments are added to a bio.

With writeback bio’s annotated, cgroup support can be enabled per super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for selective disabling of cgroup writeback support which is helpful when certain filesystem features, e.g. journaled data mode, are incompatible.

wbc_init_bio() binds the specified bio to its cgroup. Depending on the configuration, the bio may be executed at a lower priority and if the writeback session is holding shared resources, e.g. a journal entry, may lead to priority inversion. There is no one easy solution for the problem. Filesystems can try to work around specific problem cases by skipping wbc_init_bio() and using bio_associate_blkg() directly.


8 废弃的v1特性


  • 不支持多个hierarchy
  • 不支持所有的v1挂载选项
  • 所有的tasks文件被移除,使用cgroup.procs,而且不排序
  • cgroup.clone_children文件被移除
  • /proc/cgroups对于v2没有意义。在root使用cgroup.controllers文件代替


9 Issues with v1 and Rationales for v2


9.1 多层hierarchy带来的问题


v1允许任意数量的hierarchy,每个hierarchy可使用任意数量的controller。看上去提供了高度灵活性,实际上却没有用。

For example, as there is only one instance of each controller, utility type controllers such as freezer which can be useful in all hierarchies could only be used in one. The issue is exacerbated by the fact that controllers couldn’t be moved to another hierarchy once hierarchies were populated. Another issue was that all controllers bound to a hierarchy were forced to have exactly the same view of the hierarchy. It wasn’t possible to vary the granularity depending on the specific controller.

In practice, these issues heavily limited which controllers could be put on the same hierarchy and most configurations resorted to putting each controller on its own hierarchy. Only closely related ones, such as the cpu and cpuacct controllers, made sense to be put on the same hierarchy. This often meant that userland ended up managing multiple similar hierarchies repeating the same steps on each hierarchy whenever a hierarchy management operation was necessary.

Furthermore, support for multiple hierarchies came at a steep cost. It greatly complicated cgroup core implementation but more importantly the support for multiple hierarchies restricted how cgroup could be used in general and what controllers was able to do.

There was no limit on how many hierarchies there might be, which meant that a thread’s cgroup membership couldn’t be described in finite length. The key might contain any number of entries and was unlimited in length, which made it highly awkward to manipulate and led to addition of controllers which existed only to identify membership, which in turn exacerbated the original problem of proliferating number of hierarchies.

Also, as a controller couldn’t have any expectation regarding the topologies of hierarchies other controllers might be on, each controller had to assume that all other controllers were attached to completely orthogonal hierarchies. This made it impossible, or at least very cumbersome, for controllers to cooperate with each other.

In most use cases, putting controllers on hierarchies which are completely orthogonal to each other isn’t necessary. What usually is called for is the ability to have differing levels of granularity depending on the specific controller. In other words, hierarchy may be collapsed from leaf towards root when viewed from specific controllers. For example, a given configuration might not care about how memory is distributed beyond a certain level while still wanting to control how CPU cycles are distributed.


9.2 Thread Granularity


cgroup v1 allowed threads of a process to belong to different cgroups. This didn’t make sense for some controllers and those controllers ended up implementing different ways to ignore such situations but much more importantly it blurred the line between API exposed to individual applications and system management interface.

Generally, in-process knowledge is available only to the process itself; thus, unlike service-level organization of processes, categorizing threads of a process requires active participation from the application which owns the target process.

cgroup v1 had an ambiguously defined delegation model which got abused in combination with thread granularity. cgroups were delegated to individual applications so that they can create and manage their own sub-hierarchies and control resource distributions along them. This effectively raised cgroup to the status of a syscall-like API exposed to lay programs.

First of all, cgroup has a fundamentally inadequate interface to be exposed this way. For a process to access its own knobs, it has to extract the path on the target hierarchy from /proc/self/cgroup, construct the path by appending the name of the knob to the path, open and then read and/or write to it. This is not only extremely clunky and unusual but also inherently racy. There is no conventional way to define transaction across the required steps and nothing can guarantee that the process would actually be operating on its own sub-hierarchy.

cgroup controllers implemented a number of knobs which would never be accepted as public APIs because they were just adding control knobs to system-management pseudo filesystem. cgroup ended up with interface knobs which were not properly abstracted or refined and directly revealed kernel internal details. These knobs got exposed to individual applications through the ill-defined delegation mechanism effectively abusing cgroup as a shortcut to implementing public APIs without going through the required scrutiny.

This was painful for both userland and kernel. Userland ended up with misbehaving and poorly abstracted interfaces and kernel exposing and locked into constructs inadvertently.


9.3 Competition Between Inner Nodes and Threads


cgroup v1 allowed threads to be in any cgroups which created an interesting problem where threads belonging to a parent cgroup and its children cgroups competed for resources. This was nasty as two different types of entities competed and there was no obvious way to settle it. Different controllers did different things.

The cpu controller considered threads and cgroups as equivalents and mapped nice levels to cgroup weights. This worked for some cases but fell flat when children wanted to be allocated specific ratios of CPU cycles and the number of internal threads fluctuated - the ratios constantly changed as the number of competing entities fluctuated. There also were other issues. The mapping from nice level to weight wasn’t obvious or universal, and there were various other knobs which simply weren’t available for threads.

The io controller implicitly created a hidden leaf node for each cgroup to host the threads. The hidden leaf had its own copies of all the knobs with leaf_ prefixed. While this allowed equivalent control over internal threads, it was with serious drawbacks. It always added an extra layer of nesting which wouldn’t be necessary otherwise, made the interface messy and significantly complicated the implementation.

The memory controller didn’t have a way to control what happened between internal tasks and child cgroups and the behavior was not clearly defined. There were attempts to add ad-hoc behaviors and knobs to tailor the behavior to specific workloads which would have led to problems extremely difficult to resolve in the long term.

Multiple controllers struggled with internal tasks and came up with different ways to deal with it; unfortunately, all the approaches were severely flawed and, furthermore, the widely different behaviors made cgroup as a whole highly inconsistent.

This clearly is a problem which needs to be addressed from cgroup core in a uniform way.


9.4 Other Interface Issues


cgroup v1 grew without oversight and developed a large number of idiosyncrasies and inconsistencies. One issue on the cgroup core side was how an empty cgroup was notified - a userland helper binary was forked and executed for each event. The event delivery wasn’t recursive or delegatable. The limitations of the mechanism also led to in-kernel event delivery filtering mechanism further complicating the interface.

Controller interfaces were problematic too. An extreme example is controllers completely ignoring hierarchical organization and treating all cgroups as if they were all located directly under the root cgroup. Some controllers exposed a large amount of inconsistent implementation details to userland.

There also was no consistency across controllers. When a new cgroup was created, some controllers defaulted to not imposing extra restrictions while others disallowed any resource usage until explicitly configured. Configuration knobs for the same type of control used widely differing naming schemes and formats. Statistics and information knobs were named arbitrarily and used different formats and units even in the same controller.

cgroup v2 establishes common conventions where appropriate and updates controllers so that they expose minimal and consistent interfaces.


9.5 Controller Issues and Remedies


9.5.1 Memory

The original lower boundary, the soft limit, is defined as a limit that is per default unset. As a result, the set of cgroups that global reclaim prefers is opt-in, rather than opt-out. The costs for optimizing these mostly negative lookups are so high that the implementation, despite its enormous size, does not even provide the basic desirable behavior. First off, the soft limit has no hierarchical meaning. All configured groups are organized in a global rbtree and treated like equal peers, regardless where they are located in the hierarchy. This makes subtree delegation impossible. Second, the soft limit reclaim pass is so aggressive that it not just introduces high allocation latencies into the system, but also impacts system performance due to overreclaim, to the point where the feature becomes self-defeating.

The memory.low boundary on the other hand is a top-down allocated reserve. A cgroup enjoys reclaim protection when it’s within its effective low, which makes delegation of subtrees possible. It also enjoys having reclaim pressure proportional to its overage when above its effective low.

The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. But this generally goes against the goal of making the most out of the available memory. The memory consumption of workloads varies during runtime, and that requires users to overcommit. But doing that with a strict upper limit requires either a fairly accurate prediction of the working set size or adding slack to the limit. Since working set size estimation is hard and error prone, and getting it wrong results in OOM kills, most users tend to err on the side of a looser limit and end up wasting precious resources.

The memory.high boundary on the other hand can be set much more conservatively. When hit, it throttles allocations by forcing them into direct reclaim to work off the excess, but it never invokes the OOM killer. As a result, a high boundary that is chosen too aggressively will not terminate the processes, but instead it will lead to gradual performance degradation. The user can monitor this and make corrections until the minimal memory footprint that still gives acceptable performance is found.

In extreme cases, with many concurrent allocations and a complete breakdown of reclaim progress within the group, the high boundary can be exceeded. But even then it’s mostly better to satisfy the allocation from the slack available in other groups or the rest of the system than killing the group. Otherwise, memory.max is there to limit this type of spillover and ultimately contain buggy or even malicious applications.

Setting the original memory.limit_in_bytes below the current usage was subject to a race condition, where concurrent charges could cause the limit setting to fail. memory.max on the other hand will first set the limit to prevent new charges, and then reclaim and OOM kill until the new limit is met - or the task writing to memory.max is killed.

The combined memory+swap accounting and limiting is replaced by real control over swap space.

The main argument for a combined memory+swap facility in the original cgroup design was that global or parental pressure would always be able to swap all anonymous memory of a child group, regardless of the child’s own (possibly untrusted) configuration. However, untrusted groups can sabotage swapping by other means - such as referencing its anonymous memory in a tight loop - and an admin can not assume full swappability when overcommitting untrusted jobs.

For trusted jobs, on the other hand, a combined counter is not an intuitive userspace interface, and it flies in the face of the idea that cgroup controllers should account and limit specific physical resources. Swap space is a resource like all others in the system, and that’s why unified hierarchy allows distributing it separately.

相关文章
|
存储 缓存 算法
Linux 的 workqueue 机制浅析
## Intro workqueue 是 Linux 中非常重要的一种异步执行的机制,本文对该机制的各种概念,以及 work 的并行度进行分析,以帮助我们更好地**使用**这一机制;对 workqueue 机制并不陌生的读者也可以直接跳到第四节,即 "Concurrency" 小节,了解 workqueue 机制中 work 的并行度 以 v2.6.36 为界,workqueue 存在两个不
1645 0
Linux 的 workqueue 机制浅析
|
存储 Kubernetes Cloud Native
一文读懂容器存储接口 CSI
在《一文读懂 K8s 持久化存储流程》一文我们重点介绍了 K8s 内部的存储流程,以及 PV、PVC、StorageClass、Kubelet 等之间的调用关系。接下来本文将将重点放在 CSI(Container Storage Interface)容器存储接口上,探究什么是 CSI 及其内部工作原理。
一文读懂容器存储接口 CSI
|
Ubuntu 网络协议 测试技术
|
10月前
|
存储 缓存 Linux
深入了解Linux内核跟踪:ftrace基础教程
深入了解Linux内核跟踪:ftrace基础教程
深入了解Linux内核跟踪:ftrace基础教程
|
人工智能 Kubernetes Cloud Native
进击的 Kubernetes 调度系统(一):Kubernetes scheduling framework
阿里云容器服务团队结合多年 Kubernetes 产品与客户支持经验,对 Kube-scheduler 进行了大量优化和扩展,逐步使其在不同场景下依然能稳定、高效地调度各种类型的复杂工作负载。《进击的 Kubernetes 调度系统》系列文章将把我们的经验、技术思考和实现细节全面地展现给 Kubernetes 用户和开发者,期望帮助大家更好地了解 Kubernetes 调度系统的强大能力和未来发展方向。
进击的 Kubernetes 调度系统(一):Kubernetes scheduling framework
|
Linux API C语言
cgroup V1和V2的原理和区别
cgroup V1和V2的原理和区别
|
Kubernetes API 调度
Container Runtime CDI与NRI介绍
CDI介绍什么是CDICDI(Container Device Interface)是Container Runtimes支持挂载第三方设备(比如:GPU、FPGA等)机制。它引入了设备作为资源的抽象概念,这类设备由一个完全限定的名称唯一指定,该名称由设备商ID,设备类别与一个设备类别下的一个唯一名称组成,格式如下:vendor.com/class=unique_name设备商ID和设备类型(ve
3748 1
Container Runtime CDI与NRI介绍
|
存储 缓存 监控
深入浅出 eBPF 技术
1 eBPF 介绍eBPF 是革命性技术, 起源于 linux 内核, 能够在操作系统内核中执行沙盒程序。旨在不改变内核源码或加载内核模块的前提下安全便捷的扩展内核能力。1.1 demo 展示demo程序如下:#include &lt;linux/bpf.h&gt; #define SEC(NAME) __attribute__((section(NAME), used)) SEC(&quot
3013 0
深入浅出 eBPF 技术
|
监控 Linux 数据中心
在阿里云容器服务ACK下启用cgroup v2
Kubernetes采用CGroup实现容器的资源隔离。CGroup分为V1和V2版本,V2版本提供了更一致的体验和更丰富的功能。例如, 支持CGroup级别的eBPF挂载,可以实现Pod级别的资源监控、网络重定向等, 基于eBPF的网络优化也依赖于它。 本文介绍如何在阿里云容器服务ACK下启用cgroup v2。
597 1
|
弹性计算 运维 Kubernetes
全景剖析阿里云容器网络数据链路(一)—— Flannel
本系列联合作者 容器服务 @谢石 近几年,企业基础设施云原生化的趋势越来越强烈,从最开始的IaaS化到现在的微服务化,客户的颗粒度精细化和可观测性的需求更加强烈。容器网络为了满足客户更高性能和更高的密度,也一直在高速的发展和演进中,这必然对客户对云原生网络的可观测性带来了极高的门槛和挑战。为了提高云原生网络的可观测性,同时便于客户和前后线同学增加对业务链路的可读性
833 0
全景剖析阿里云容器网络数据链路(一)—— Flannel