不确定这是否与Kata有关。我对Kata和使用带有VFIO的设备直通都是新手,所以很难理解根本原因。
我有一个带有三个Intel 82580千兆网卡的工作站。每个卡上有4个端口,提供12个NIC/端口。
pci@0000:01:00.0 enp1s0f0 network 82580 Gigabit Network Connection [8086:150E] pci@0000:01:00.1 enp1s0f1 network 82580 Gigabit Network Connection [8086:150E] pci@0000:01:00.2 enp1s0f2 network 82580 Gigabit Network Connection [8086:150E] pci@0000:01:00.3 enp1s0f3 network 82580 Gigabit Network Connection [8086:150E] pci@0000:02:00.0 enp2s0f0 network 82580 Gigabit Network Connection [8086:150E] pci@0000:02:00.1 enp2s0f1 network 82580 Gigabit Network Connection [8086:150E] pci@0000:02:00.2 enp2s0f2 network 82580 Gigabit Network Connection [8086:150E] pci@0000:02:00.3 enp2s0f3 network 82580 Gigabit Network Connection [8086:150E] pci@0000:03:00.0 enp3s0f0 network 82580 Gigabit Network Connection [8086:150E] pci@0000:03:00.1 enp3s0f1 network 82580 Gigabit Network Connection [8086:150E] pci@0000:03:00.2 enp3s0f2 network 82580 Gigabit Network Connection [8086:150E] pci@0000:03:00.3 enp3s0f3 network 82580 Gigabit Network Connection [8086:150E]
我想创建12个Kata容器,每个Kata容器将获得其中一个带有设备直通的物理NIC。
以下是我使用VFIO配置设备直通的方式:
$ NIC=enp1s0f0 $ BDF=$(sudo lshw -class network -businfo -numeric | grep ${NIC} | awk '{print $1;}' | cut -d@ -f2) $ sudo echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind $ sudo lspci -n -s $BDF 0000:01:00.0 0200: 8086:150e (rev 01)
$ sudo echo 8086 150e | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id $ sudo echo 8086 150e | sudo tee /sys/bus/pci/drivers/vfio-pci/remove_id
然后这是所有设备的清单:
ll /dev/vfio/ total 0 drwxr-xr-x 2 root root 300 Mar 29 10:35 ./ drwxr-xr-x 18 root root 4340 Mar 29 10:35 ../ crw------- 1 root root 241, 0 Mar 29 10:35 10 crw------- 1 root root 241, 1 Mar 29 10:35 11 crw------- 1 root root 241, 2 Mar 29 10:35 12 crw------- 1 root root 241, 3 Mar 29 10:35 13 crw------- 1 root root 241, 4 Mar 29 10:35 14 crw------- 1 root root 241, 5 Mar 29 10:35 15 crw------- 1 root root 241, 6 Mar 29 10:35 16 crw------- 1 root root 241, 7 Mar 29 10:35 17 crw------- 1 root root 241, 8 Mar 29 10:35 18 crw------- 1 root root 241, 9 Mar 29 10:35 19 crw------- 1 root root 241, 10 Mar 29 10:35 20 crw------- 1 root root 241, 11 Mar 29 10:35 21 crw-rw-rw- 1 root root 10, 196 Mar 29 10:28 vfio
然后我启动每个Kata容器(1-7):
sudo nerdctl run --cgroup-manager cgroupfs --runtime "io.containerd.kata.v2" --cap-add=CAP_NET_ADMIN -d --device /dev/vfio/11 --name tga1 ubuntu:latest sleep infinity ... sudo nerdctl run --cgroup-manager cgroupfs --runtime "io.containerd.kata.v2" --cap-add=CAP_NET_ADMIN -d --device /dev/vfio/17 --name tga7 ubuntu:latest sleep infinity
所以到目前为止一切都很顺利,从卡塔容器1-7开始。但当我试图启动Kata容器8-12时,它失败了:
sudo nerdctl run --cgroup-manager cgroupfs --runtime "io.containerd.kata.v2" --cap-add=CAP_NET_ADMIN -d --device /dev/vfio/18 --name tga8 ubuntu:latest sleep infinity FATA[0001] failed to create shim task: QMP command failed: Device 'vfio-638fa5a1eac4abed0' not found: not found
容器日志如下:
Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.121740023Z" level=error msg="VFIO_MAP_DMA failed: Bad address" name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.182792456Z" level=error msg="failed to hotplug VFIO device" error="QMP command failed: Device 'vfio-638fa5a1eac4abed0' not found" name=containerd-shim-v2 pid=3629 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers subsystem=sandbox vfio-device-BDF="03:00.0" vfio-device-ID=vfio-638fa5a1eac4abed0 Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.182952308Z" level=error msg="Failed to add device" error="QMP command failed: Device 'vfio-638fa5a1eac4abed0' not found" name=containerd-shim-v2 pid=3629 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers subsystem=device Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.183018247Z" level=error msg="container create failed" container=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 error="QMP command failed: Device 'vfio-638fa5a1eac4abed0' not found" name=containerd-shim-v2 pid=3629 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers subsystem=container Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.183155694Z" level=warning error="no such file or directory" name=containerd-shim-v2 pid=3629 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 share-dir=/run/kata-containers/shared/sandboxes/bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585/mounts/bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585/rootfs source=virtcontainers subsystem=mount Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.183248451Z" level=warning msg="Could not remove container share dir" error="no such file or directory" name=containerd-shim-v2 pid=3629 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 share-dir=/run/kata-containers/shared/sandboxes/bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585/mounts/bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers subsystem=fs_share Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.186362442Z" level=error msg="qemu-system-x86_64: Failed to write msg. Wrote -1 instead of 20." name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.187017225Z" level=error msg="qemu-system-x86_64: Failed to set msg fds." name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.187118664Z" level=error msg="qemu-system-x86_64: vhost VQ 0 ring restore failed: -22: Invalid argument (22)" name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.187205270Z" level=error msg="qemu-system-x86_64: Failed to set msg fds." name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.187560148Z" level=error msg="qemu-system-x86_64: vhost VQ 1 ring restore failed: -22: Invalid argument (22)" name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.187841618Z" level=error msg="qemu-system-x86_64: Failed to set msg fds." name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.187930071Z" level=error msg="qemu-system-x86_64: vhost_set_vring_call failed: Invalid argument (22)" name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.188174089Z" level=error msg="qemu-system-x86_64: Failed to set msg fds." name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.188280375Z" level=error msg="qemu-system-x86_64: vhost_set_vring_call failed: Invalid argument (22)" name=containerd-shim-v2 pid=3629 qemuPid=3640 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=virtcontainers/hypervisor subsystem=qemu Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.407737009Z" level=warning msg="failed to cleanup network" error="failed to get netns /var/run/netns/cnitest-c9473abe-8a0e-c3db-97fc-587148b1e378: failed to Statfs "/var/run/netns/cnitest-c9473abe-8a0e-c3db-97fc-587148b1e378": no such file or directory" id=/var/run/netns/cnitest-c9473abe-8a0e-c3db-97fc-587148b1e378 name=containerd-shim-v2 pid=3629 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=katautils Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.409331737Z" level=info msg="shim disconnected" id=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.409378160Z" level=warning msg="cleaning up after shim disconnected" id=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 namespace=default Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.409385164Z" level=info msg="cleaning up dead shim" Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.425715930Z" level=error msg="failed to delete" cmd="/usr/bin/containerd-shim-kata-v2 -namespace default -address /run/containerd/containerd.sock -publish-binary /usr/local/bin/containerd -id bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 -bundle /run/containerd/io.containerd.runtime.v2.task/default/bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 delete" error="exit status 1" Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.425779746Z" level=warning msg="failed to clean up after shim disconnected" error="time="2023-03-29T10:45:40Z" level=warning msg="failed to cleanup container" container=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 error="open /run/vc/sbs/bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585: no such file or directory" name=containerd-shim-v2 pid=3774 sandbox=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 source=containerd-kata-shim-v2\nio.containerd.kata.v2: open /run/vc/sbs/bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585: no such file or directory: exit status 1" id=bd409e48d98ff9817b98bf29244980b955305d5524abbf722d3cae7072682585 namespace=default Mar 29 10:45:40 wse-c0260 containerd[874]: time="2023-03-29T10:45:40.425835086Z" level=error msg="copy shim log" error="read /proc/self/fd/45: file already closed"
所以我真的不明白发生了什么。我试着在互联网上搜索,但还没有找到任何能解决我问题的东西。所以我不确定这是否与VFIO和设备直通或Kata有关,因为这两个领域对我来说都是新的。
如果有人能给我一些意见,帮助我前进,我将不胜感激。
Versions: Ubuntu 22.04 Kata 3.1.0 containerd 1.6.18 nerdctl 1.2.1
原提问者GitHub用户tse77 如对项目有进一步反馈,请在 GitHub 提交 issue https://github.com/kata-containers/kata-containers/issues
不知道发生了什么,但有几个注意事项:
-您首先启动的是/dev/vfio/11,而不是/dev/vfio/10。这并不重要,因为它在第8位而不是第12位失败了。
-看看实际的qemu命令行是什么会很有用,看看我们是否能发现一些明显的东西。
-dmesg日志可能会给我们更多线索。
原回答者GitHub用户c3d 如对项目有进一步反馈,请在 GitHub 提交 issue https://github.com/kata-containers/kata-containers/issues
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。