4 其它工具
4.1 seccmop-bpf.h
seccomp-bpf.h
是由开发人员编写的一个十分便捷的头文件,用于开发seccomp-bpf
。该头文件已经定义好了很多常见的宏,如验证系统架构、允许系统调用等功能,十分便捷,如下所示。
... define VALIDATE_ARCHITECTURE \ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, arch_nr), \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARCH_NR, 1, 0), \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL) define EXAMINE_SYSCALL \ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_nr) define ALLOW_SYSCALL(name) \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW) define KILL_PROCESS \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL) ...
4.2 应用示例(seccomp_policy.c
)
#include <fcntl.h> #include <stdio.h> #include <string.h> #include <unistd.h> #include <assert.h> #include <linux/seccomp.h> #include <sys/prctl.h> #include "seccomp-bpf.h" void install_syscall_filter() { struct sock_filter filter[] = { /* Validate architecture. */ VALIDATE_ARCHITECTURE, /* Grab the system call number. */ EXAMINE_SYSCALL, /* List allowed syscalls. We add open() to the set of allowed syscalls by the strict policy, but not close(). */ ALLOW_SYSCALL(rt_sigreturn), #ifdef __NR_sigreturn ALLOW_SYSCALL(sigreturn), #endif ALLOW_SYSCALL(exit_group), ALLOW_SYSCALL(exit), ALLOW_SYSCALL(read), ALLOW_SYSCALL(write), ALLOW_SYSCALL(open), KILL_PROCESS, }; struct sock_fprog prog = { .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])), .filter = filter, }; assert(prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) == 0); assert(prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) == 0); } int main(int argc, char **argv) { int output = open("output.txt", O_WRONLY); const char *val = "test"; printf("Calling prctl() to set seccomp with filter...\n"); install_syscall_filter(); printf("Writing to an already open file...\n"); write(output, val, strlen(val)+1); printf("Trying to open file for reading...\n"); int input = open("output.txt", O_RDONLY); printf("Note that open() worked. However, close() will not\n"); close(input); printf("You will not see this message--the process will be killed first\n"); }
执行结果
$ ./seccomp_policy Calling prctl() to set seccomp with filter... Writing to an already open file... Trying to open file for reading... Note that open() worked. However, close() will not Bad system call
4.3 seccomp-tools
一款用于分析seccomp
的开源工具,项目地址:https://github.com/david942j/seccomp-tools
主要功能:
Dump
:从可执行文件中自动转储seccomp BPF
Disasm
: 将seccomp BPF
转换为人类可读的格式Asm
:使编写seccomp
规则类似于编写代码Emu
: 模拟seccomp
规则
安装
sudo apt install gcc ruby-dev gem install seccomp-tools
使用
$ seccomp-tools dump ./simple_syscall_seccomp line CODE JT JF K ================================= 0000: 0x20 0x00 0x00 0x00000004 A = arch 0001: 0x15 0x00 0x05 0xc000003e if (A != ARCH_X86_64) goto 0007 0002: 0x20 0x00 0x00 0x00000000 A = sys_number 0003: 0x35 0x00 0x01 0x40000000 if (A < 0x40000000) goto 0005 0004: 0x15 0x00 0x02 0xffffffff if (A != 0xffffffff) goto 0007 0005: 0x15 0x01 0x00 0x0000003b if (A == execve) goto 0007 0006: 0x06 0x00 0x00 0x7fff0000 return ALLOW 0007: 0x06 0x00 0x00 0x00000000 return KIL
从输出中可知禁用了execve
系统调用。
5 使用Seccomp保护Docker的安全
Seccomp
技术被用在很多应用程序上以保护系统的安全性,Docker
支持使用seccomp
来限制容器的系统调用,不过需要启用内核中的CONFIG_SECCOMP
。
$ grep CONFIG_SECCOMP= /boot/config-$(uname -r) CONFIG_SECCOMP=y
当使用docker run
启动一个容器时,Docker
会使用默认的seccomp
配置文件来对容器施加限制策略,该默认文件是以json
格式编写,在300
多个系统调用中禁用了大约44
个系统调用,可以在Moby
项目中找到该源码。
$ sudo docker run --rm -it ubuntu /bin/bash root@9e271f2056bd:/# root@9e271f2056bd:/# bash root@9e271f2056bd:/# ps PID TTY TIME CMD 1 pts/0 00:00:00 bash 10 pts/0 00:00:00 bash 13 pts/0 00:00:00 ps root@9e271f2056bd:/# grep -i seccomp /proc/1/status Seccomp: 2 Seccomp_filters: 1 root@9e271f2056bd:/#
Docker
中默认的配置文件提供了最大限度的包容性,除了默认的选择之外,Docker
允许我们自定义该配置文件来灵活的对容器的系统调用进行限制。
5.1 示例1:以白名单的形式允许特定的系统调用
文件名称为example.json
:
{ "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32" ], "syscalls": [ { "names": [ "arch_prctl", "sched_yield", "futex", "write", "mmap", "exit_group", "madvise", "rt_sigprocmask", "getpid", "gettid", "tgkill", "rt_sigaction", "read", "getpgrp" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": {}, "excludes": {} } ] }
解释:
- defaultAction : 指定默认的
seccomp
操作,具体的可选参数上面已经介绍过了,最常用的无非是SCMP_ACT_ALLOW
、SCMP_ACT_ERRNO
。这里选择SCMP_ACT_ERRNO
,表示默认禁止全部系统调用,以白名单的形式在赋予可用的系统调用。 - architectures : 系统架构,不同的系统架构系统调用可能不同。
- syscalls:指定系统调用以及对应的操作,
name
定义系统调用名,action
对应的操作,这里表示允许name
里边中的系统调用,args
对应系统调用参数,可以为空。
这样,在使用 docker run
运行容器时,就可以使用 --security-opt
选项指定该配置文件来对容器进行系统调用定制。
$ docker run --rm -it --security-opt seccomp=/path/to/seccomp/example.json hello-world
5.2 示例2:禁止容器创建文件夹,就可以用黑名单的形式禁用mkdir系统调用
文件名称seccomp_mkdir.json:
{ "defaultAction": "SCMP_ACT_ALLOW", "syscalls": [ { "name": "mkdir", "action": "SCMP_ACT_ERRNO", "args": [] } ] }
$ sudo docker run --rm -it --security-opt seccomp=seccomp_mkdir.json busybox /bin/sh Unable to find image 'busybox:latest' locally latest: Pulling from library/busybox 405fecb6a2fa: Pull complete Digest: sha256:fcd85228d7a25feb59f101ac3a955d27c80df4ad824d65f5757a954831450185 Status: Downloaded newer image for busybox:latest / # / # ls bin dev etc home proc root sys tmp usr var / # mkdir test mkdir: can't create directory 'test': Operation not permitted / #
当然也可以不加任何seccomp
策略启动容器,只需要在启动选项中加上--security-opt seccomp=unconfined
即可。
5.3 zaz
zaz seccomp
是一个可以为容器自动生成json
格式的seccomp
文件的开源工具,项目地址:https://github.com/pjbgf/zaz
。
主要用法为:
zaz seccomp docker IMAGE COMMAND
它能够为特定的可执行文件定制系统调用,以只允许特定的操作,禁止其他操作。
举个例子:为alpine
中的ping
命令生成seccomp
配置文件
$ sudo ./zaz seccomp docker alpine "ping -c5 8.8.8.8" > seccomp_ping.json $ cat seccomp_ping.json | jq '.' { "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32" ], "syscalls": [ { "names": [ "arch_prctl", "bind", "clock_gettime", "clone", "close", "connect", "dup2", "epoll_pwait", "execve", "exit", "exit_group", "fcntl", "futex", "getpid", "getsockname", "getuid", "ioctl", "mprotect", "nanosleep", "open", "poll", "read", "recvfrom", "rt_sigaction", "rt_sigprocmask", "rt_sigreturn", "sendto", "set_tid_address", "setitimer", "setsockopt", "socket", "write", "writev" ], "action": "SCMP_ACT_ALLOW" } ] }
如上所示,zaz
检测到了33
个系统调用,使用白名单的形式过滤系统调用。那它以白名单的形式生成的系统调用能否很好的过滤系统系统呢?是否能够满足运行ping
命令,而不能运行除了它允许的系统调用之外的命令呢?做个测试,首先用下面Dockerfile
构建一个简单的镜像。
// Dockerfile FROM alpine:latest CMD ["ping","-c5","8.8.8.8"]
构建成功后,使用默认的seccomp
策略启动容器,没有任何问题,可以运行。
$ sudo docker build -t pingtest . $ sudo docker run --rm -it pingtest PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=127 time=42.139 ms 64 bytes from 8.8.8.8: seq=1 ttl=127 time=42.646 ms 64 bytes from 8.8.8.8: seq=2 ttl=127 time=42.098 ms 64 bytes from 8.8.8.8: seq=3 ttl=127 time=42.484 ms 64 bytes from 8.8.8.8: seq=4 ttl=127 time=42.007 ms --- 8.8.8.8 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 42.007/42.274/42.646 ms
接着我们使用上述zaz
生成的策略试试。
$ sudo docker run --rm -it --security-opt seccomp=seccomp_ping.json pingtest docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: close exec fds: open /proc/self/fd: operation not permitted: unknown.
容器并没有成功启动,在创建OCI
的时候就报错了,报错原因是operation not permitted
,这个报错上面似乎提到过,是想要使用的系统调用被禁用的缘故,可能zaz
这种白名单的模式鲁棒性还是不够强,而且Docker
更新那么多次,zaz
缺乏维护导致捕获的系统调用不足,在容器启动过程中出现了问题。奇怪的是,当我在此运行同样的命令,却引发了panic
报错:No error following JSON procError payload
。
$ sudo docker run --rm -it --security-opt seccomp=seccomp_ping.json pingtest docker: Error response from daemon: failed to create shim: OCI runtime create failed: runc did not terminate successfully: exit status 2: panic: No error following JSON procError payload. goroutine 1 [running]: github.com/opencontainers/runc/libcontainer.parseSync(0x56551adf30b8, 0xc000010b20, 0xc0002268a0, 0xc00027f9e0, 0x0) github.com/opencontainers/runc/libcontainer/sync.go:93 +0x307 github.com/opencontainers/runc/libcontainer.(*initProcess).start(0xc000297cb0, 0x0, 0x0) github.com/opencontainers/runc/libcontainer/process_linux.go:440 +0x5ef github.com/opencontainers/runc/libcontainer.(*linuxContainer).start(0xc000078700, 0xc000209680, 0x0, 0x0) github.com/opencontainers/runc/libcontainer/container_linux.go:379 +0xf5 github.com/opencontainers/runc/libcontainer.(*linuxContainer).Start(0xc000078700, 0xc000209680, 0x0, 0x0) github.com/opencontainers/runc/libcontainer/container_linux.go:264 +0xb4 main.(*runner).run(0xc0002274c8, 0xc0000200f0, 0x0, 0x0, 0x0) github.com/opencontainers/runc/utils_linux.go:312 +0xd2a main.startContainer(0xc00025c160, 0xc000076400, 0x1, 0x0, 0x0, 0xc0002275b8, 0x6) github.com/opencontainers/runc/utils_linux.go:455 +0x455 main.glob..func2(0xc00025c160, 0xc000246000, 0xc000246120) github.com/opencontainers/runc/create.go:65 +0xbb github.com/urfave/cli.HandleAction(0x56551ad3b040, 0x56551ade81e8, 0xc00025c160, 0xc00025c160, 0x0) github.com/urfave/cli@v1.22.1/app.go:523 +0x107 github.com/urfave/cli.Command.Run(0x56551aa566f5, 0x6, 0x0, 0x0, 0x0, 0x0, 0x0, 0x56551aa5f509, 0x12, 0x0, ...) github.com/urfave/cli@v1.22.1/command.go:174 +0x579 github.com/urfave/cli.(*App).Run(0xc000254000, 0xc000132000, 0xf, 0xf, 0x0, 0x0) github.com/urfave/cli@v1.22.1/app.go:276 +0x7e8 main.main() github.com/opencontainers/runc/main.go:163 +0xd3f : unknown.
这种报错或许是不应该的,我尝试在网上寻找报错的相关信息,类似的情况很少,而且并不是每次运行都是出现这种panic
,正常情况下应该是operation not permitted
,这是由于我们的白名单没有完全包括必须的系统调用导致的。目前将此情况汇报给了Moby issue
,或许能够得到一些解答。
类似panic
信息:https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=1714183
无论是哪种报错,看起来都是runc
出了问题,尝试解决这个问题,我们就要知道Docker
到底是如何在运行时加载seccomp
?
当我们要创建一个容器的时候 ,容器守护进程 Dockerd
会请求containerd
来创建一个容器 , containerd
收到请求后,也并不会直接去操作容器,而是创建一个叫做 containerd-shim
的进程,让这个进程去操作容器,之后containerd-shim
会通过OCI
去调用容器运行时runc
来启动容器, runc
启动完容器后本身会直接退出,containerd-shim
则会成为容器进程的父进程, 负责收集容器进程的状态, 上报给 containerd
, 并在容器中 pid
为 1
的进程退出后接管容器中的子进程进行清理, 确保不会出现僵尸进程 。也就是说调用顺序为
Dockerd -> containerd -> containerd-shim -> runc
启动一个容器ubuntu
,并在容器中再运行一个bash
$ sudo docker run --rm -it ubuntu /bin/bash root@ef57fff95b80:/# bash root@ef57fff95b80:/# ps PID TTY TIME CMD 1 pts/0 00:00:00 bash 9 pts/0 00:00:00 bash 12 pts/0 00:00:00 ps
查看调用栈,containerd-shim(28051-28129)
并没有被施加seccomp
,而容器内的两个bash(1 -> 28075;9->28126)
被施加了seccomp
策略。
# pstree -p | grep containerd-shim |-containerd-shim(28051)-+-bash(28075)---bash(28126) | |-{containerd-shim}(28052) | |-{containerd-shim}(28053) | |-{containerd-shim}(28054) | |-{containerd-shim}(28055) | |-{containerd-shim}(28056) | |-{containerd-shim}(28057) | |-{containerd-shim}(28058) | |-{containerd-shim}(28059) | |-{containerd-shim}(28060) | `-{containerd-shim}(28129) # grep -i seccomp /proc/28051/status Seccomp: 0 # grep -i seccomp /proc/28075/status Seccomp: 2 # grep -i seccomp /proc/28126/status Seccomp: 2 # grep -i seccomp /proc/28052/status Seccomp: 0 ... # grep -i seccomp /proc/28129/status Seccomp: 0
也就是说对容器施加seccomp
是在container-shim
启动之后,在调用runc
的时候出现了问题,是否我们的seccomp
策略也要将runc
所必须的系统调用考虑进去呢?Zaz
是否考虑了容器启动时候的runc
所必须的系统调用?
这就需要捕获容器在启动时,runc
所必要的系统调用了。
5.4 Sysdig
为了获取容器运行时runc
用了哪些系统调用,可以有很多方法,比如ftrace
、strace
、fanotify
等,这里使用sysdig
来监控容器的运行,sisdig
是一款原生支持容器的系统可见性工具,项目地址:https://github.com/draios/sysdig。具体的安装和使用方法可以参考GitHub
上给出的详细教程,这里只做简单介绍。
安装完成后,直接在命令行运行sysdig
,不加任何参数, sysdig
会捕获所有的事件并将其写入标准输出 :
$ sysdig 285304 01:21:51.270700399 7 sshd (50485) > select 285306 01:21:51.270701716 7 sshd (50485) < select res=2 285307 01:21:51.270701982 7 sshd (50485) > rt_sigprocmask 285308 01:21:51.270702258 7 sshd (50485) < rt_sigprocmask 285309 01:21:51.270702473 7 sshd (50485) > rt_sigprocmask 285310 01:21:51.270702660 7 sshd (50485) < rt_sigprocmask 285312 01:21:51.270702983 7 sshd (50485) > read fd=13(<f>/dev/ptmx) size=16384 285313 01:21:51.270703971 1 sysdig (59131) > switch next=59095 pgft_maj=0 pgft_min=1759 vm_size=280112 vm_rss=18048 vm_swap=0 ...
默认情况下,sysdig
在一行中打印每个事件的信息,格式如下
%evt.num %evt.time %evt.cpu %proc.name (%thread.tid) %evt.dir %evt.type %evt.args
其中
evt.num
是递增的事件编号evt.time
是事件时间戳evt.cpu
是捕获事件的 CPU 编号proc.name
是生成事件的进程的名称thread.tid
是产生事件的TID,对应单线程进程的PIDevt.dir
是事件方向,> 表示进入事件,< 表示退出事件evt.type
是事件的名称,例如“open”或“read”evt.args
是事件参数的列表。在系统调用的情况下,这些往往对应于系统调用参数,但情况并非总是如此:出于简单或性能原因,某些系统调用参数被排除在外。
启动一个终端A
,输入以下命令进行监控,container.name
指定捕获容器名为ping
,proc.name
指定进程名为runc
的包,保存为runc.scap
.
$sysdig -w runc.scap container.name=ping&&proc.name=runc
接着在另一个终端B启动该容器:
$sudo docker run --rm -it --name=ping pingtest PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=127 time=44.032 ms 64 bytes from 8.8.8.8: seq=1 ttl=127 time=42.069 ms 64 bytes from 8.8.8.8: seq=2 ttl=127 time=42.066 ms 64 bytes from 8.8.8.8: seq=3 ttl=127 time=42.073 ms 64 bytes from 8.8.8.8: seq=4 ttl=127 time=42.112 ms --- 8.8.8.8 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 42.066/42.470/44.032 ms
执行完毕后,在终端A使用ctrl+c
停止捕获,并筛选捕获的内容,只留系统调用,将结果保存到runc_syscall.txt
中,这样我们就得到了启动容器时runc
使用了哪些系统调用。
$ sysdig -p "%syscall.type" -r runc.scap | runc_syscall.txt $ cat -n runc_syscall.txt ... 3437 rt_sigaction 3438 exit_group 3439 procexit
可以发现筛选出的系统调用数还是有很多的,其中包含很多重复的系统调用,这里可以简单的写一个脚本,进行过滤,通过过滤后,一共有72
个系统调用。
$ python analyse.py runc_syscall.txt Filter syscall num: 72 filter syscall:['clone', 'close', 'prctl', 'getpid', 'write', 'unshare', 'read', 'exit_group', 'procexit', 'setsid', 'setuid', 'setgid', 'sched_getaffinity', 'openat', 'mmap', 'rt_sigprocmask', 'sigaltstack', 'gettid', 'rt_sigaction', 'mprotect', 'futex', 'set_robust_list', 'munmap', 'nanosleep', 'readlinkat', 'fcntl', 'epoll_create1', 'pipe', 'epoll_ctl', 'fstat', 'pread', 'getdents64', 'capget', 'epoll_pwait', 'newfstatat', 'statfs', 'getppid', 'keyctl', 'socket', 'bind', 'sendto', 'getsockname', 'recvfrom', 'mount', 'fchmodat', 'mkdirat', 'symlinkat', 'umask', 'mknodat', 'fchownat', 'unlinkat', 'chdir', 'fchdir', 'pivot_root', 'umount', 'dup', 'sethostname', 'fstatfs', 'seccomp', 'brk', 'fchown', 'setgroups', 'capset', 'execve', 'signaldeliver', 'access', 'arch_prctl', 'getuid', 'getgid', 'geteuid', 'getcwd', 'getegid']
将zaz
生成的系统调用与我们捕获的系统调用合二为一,系统调用数到了85
个。如下:
{ "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32" ], "syscalls": [ { "names": [ "clone", "close", "prctl", "getpid", "write", "unshare", "read", "exit_group", "procexit", "setsid", "setuid", "setgid", "sched_getaffinity", "openat", "mmap", "rt_sigprocmask", "sigaltstack", "gettid", "rt_sigaction", "mprotect", "futex", "set_robust_list", "munmap", "nanosleep", "readlinkat", "fcntl", "epoll_create1", "pipe", "epoll_ctl", "fstat", "pread", "getdents64", "capget", "epoll_pwait", "newfstatat", "statfs", "getppid", "keyctl", "socket", "bind", "sendto", "getsockname", "recvfrom", "mount", "fchmodat", "mkdirat", "symlinkat", "umask", "mknodat", "fchownat", "unlinkat", "chdir", "fchdir", "pivot_root", "umount", "dup", "sethostname", "fstatfs", "seccomp", "brk", "fchown", "setgroups", "capset", "signaldeliver", "access", "getuid", "getgid", "geteuid", "getcwd", "getegid", "arch_prctl", "clock_gettime", "connect", "dup2", "execve", "exit", "ioctl", "open", "poll", "rt_sigreturn", "set_tid_address", "setitimer", "setsockopt", "socket", "writev" ], "action": "SCMP_ACT_ALLOW" } ] }
通过该文件再次运行容器,发现可以成功运行!
$ sudo docker run -it --rm --security-opt seccomp=seccomp_ping.json pingtest PING 8.8.8.8 (8.8.8.8): 56 data bytes 64 bytes from 8.8.8.8: seq=0 ttl=127 time=43.424 ms 64 bytes from 8.8.8.8: seq=1 ttl=127 time=42.873 ms 64 bytes from 8.8.8.8: seq=2 ttl=127 time=42.336 ms 64 bytes from 8.8.8.8: seq=3 ttl=127 time=48.164 ms 64 bytes from 8.8.8.8: seq=4 ttl=127 time=42.260 ms --- 8.8.8.8 ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 42.260/43.811/48.164 ms
尝试运行其他命令,有些命令由于缺乏必须的系统调用,会出现Operation not permitted
的报错。
$ sudo docker run -it --rm --security-opt seccomp=seccomp_ping.json pingtest ls ls: .: Operation not permitted $ sudo docker run -it --rm --security-opt seccomp=seccomp_ping.json pingtest mkdir test mkdir: can't create directory 'test': Operation not permitted
6 参考链接
- BPF操作码
- seccomp_rule_add
- seccomp和seccomp bfp
- seccomp 概述
- seccomp沙箱机制 & 2019ByteCTF VIP
- prctl(2) — Linux manual page
- seccomp-tools
- libseccomp
- docker seccomp
- Docker seccomp 与OCI