开发者社区 > 云原生 > 容器服务 > 正文

使用Kubernetes运行kata容器:失败: Could not setup vhost fds

我试着用Kubernetes运行kata容器。我可以成功运行nginx容器,但在运行kata容器时出错:

Events: Type Reason Age From Message


Normal Scheduled 12m default-scheduler Successfully assigned default/nginx-kata-containers to epyc-maggie Warning FailedCreatePodSandBox 2m24s (x48 over 12m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = CreateContainer failed: Could not setup vhost fds eth0 : open /dev/vhost-net: no such file or directory: unknown

.yaml files

apiVersion: v1 kind: Pod metadata: name: nginx-kata-containers spec: runtimeClassName: kata containers: - name: nginx image: nginx

apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: kata handler: kata

配置

kata-runtime kata-env

[Kernel] Path = "/usr/share/kata-containers/vmlinuz-5.19.2-98-snp" Parameters = "scsi_mod.scan=none agent.log=debug agent.log=debug"

[Meta] Version = "1.0.26"

[Image] Path = ""

[Initrd] Path = "/usr/share/kata-containers/kata-containers-initrd-2023-01-15-23:33:02.561090892+0100-807eeaafd"

[Hypervisor] MachineType = "q35" Version = "QEMU emulator version 6.1.50\nCopyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers" Path = "/home/user/qemu/build/qemu-system-x86_64" BlockDeviceDriver = "virtio-scsi" EntropySource = "/dev/urandom" SharedFS = "virtio-9p" VirtioFSDaemon = "" SocketPath = "<>" Msize9p = 8192 MemorySlots = 10 PCIeRootPort = 0 HotplugVFIOOnRootBus = false Debug = true

[Runtime] Path = "/usr/local/bin/kata-runtime" Debug = true Trace = false DisableGuestSeccomp = true DisableNewNetNs = false SandboxCgroupOnly = false [Runtime.Config] Path = "/usr/share/defaults/kata-containers/configuration.toml" [Runtime.Version] OCI = "1.0.2-dev" [Runtime.Version.Version] Semver = "3.1.0-alpha0" Commit = "8246de821ff327beb750b6239ac6c9cc6272f217" Major = 3 Minor = 1 Patch = 0

[Host] AvailableGuestProtections = ["snp"] Kernel = "5.19.0-rc6-snp-host-a7065246cf78" Architecture = "amd64" VMContainerCapable = false SupportVSocks = true [Host.Distro] Name = "Ubuntu" Version = "20.04" [Host.CPU] Vendor = "AuthenticAMD" Model = "AMD EPYC 7313 16-Core Processor" CPUs = 32 [Host.Memory] Total = 114616484 Free = 83142388 Available = 111471836

[Agent] Debug = true Trace = false

/usr/share/defaults/kata-containers/configuration.toml

Copyright (c) 2017-2019 Intel Corporation

Copyright (c) 2021 Adobe Inc.

SPDX-License-Identifier: Apache-2.0

XXX: WARNING: this file is auto-generated.

XXX:

XXX: Source file: "config/configuration-qemu.toml.in"

XXX: Project:

XXX: Name: Kata Containers

XXX: Type: kata

[hypervisor.qemu] path = "/home/user/qemu/build/qemu-system-x86_64" kernel = "/usr/share/kata-containers/vmlinuz-snp.container" initrd = "/usr/share/kata-containers/kata-containers-initrd.img" machine_type = "q35"

Enable confidential guest support.

Toggling that setting may trigger different hardware features, ranging

from memory encryption to both memory and CPU-state encryption and integrity.

The Kata Containers runtime dynamically detects the available feature set and

aims at enabling the largest possible one, returning an error if none is

available, or none is supported by the hypervisor.

Known limitations:

* Does not work by design:

- CPU Hotplug

- Memory Hotplug

- NVDIMM devices

Default false

confidential_guest = true

Choose AMD SEV-SNP confidential guests

In case of using confidential guests on AMD hardware that supports both SEV

and SEV-SNP, the following enables SEV-SNP guests. SEV guests are default.

Default false

sev_snp_guest = true

Enable running QEMU VMM as a non-root user.

By default QEMU VMM run as root. When this is set to true, QEMU VMM process runs as

a non-root random user. See documentation for the limitations of this mode.

rootless = true

List of valid annotation names for the hypervisor

Each member of the list is a regular expression, which is the base name

of the annotation, e.g. "path" for io.katacontainers.config.hypervisor.path"

enable_annotations = ["enable_iommu"]

List of valid annotations values for the hypervisor

Each member of the list is a path pattern as described by glob(3).

The default if not set is empty (all annotations rejected.)

Your distribution recommends: ["/usr/bin/qemu-system-x86_64"]

valid_hypervisor_paths = ["/usr/bin/qemu-system-x86_64"]

Optional space-separated list of options to pass to the guest kernel.

For example, use kernel_params = "vsyscall=emulate" if you are having

trouble running pre-2.15 glibc.

WARNING: - any parameter specified here will take priority over the default

parameter value of the same name used to start the virtual machine.

Do not set values here unless you understand the impact of doing so as you

may stop the virtual machine from booting.

To see the list of default parameters, enable hypervisor debug, create a

container and look for 'default-kernel-parameters' log entries.

kernel_params = " agent.log=debug"

Path to the firmware.

If you want that qemu uses the default firmware leave this option empty

firmware = "/home/user/kata-containers/tools/packaging/static-build/ovmf/opt/kata/share/ovmf/OVMF.fd"

Path to the firmware volume.

firmware TDVF or OVMF can be split into FIRMWARE_VARS.fd (UEFI variables

as configuration) and FIRMWARE_CODE.fd (UEFI program image). UEFI variables

can be customized per each user while UEFI code is kept same.

firmware_volume = ""

Machine accelerators

comma-separated list of machine accelerators to pass to the hypervisor.

For example, machine_accelerators = "nosmm,nosmbus,nosata,nopit,static-prt,nofw"

machine_accelerators=""

Qemu seccomp sandbox feature

comma-separated list of seccomp sandbox features to control the syscall access.

For example, seccompsandbox= "on,obsolete=deny,spawn=deny,resourcecontrol=deny"

Note: "elevateprivileges=deny" doesn't work with daemonize option, so it's removed from the seccomp sandbox

Another note: enabling this feature may reduce performance, you may enable

/proc/sys/net/core/bpf_jit_enable to reduce the impact. see https://man7.org/linux/man-pages/man8/bpfc.8.html

#seccompsandbox="on,obsolete=deny,spawn=deny,resourcecontrol=deny"

CPU features

comma-separated list of cpu features to pass to the cpu

For example, `cpu_features = "pmu=off,vmx=off"

cpu_features="-vmx-rdseed-exit,pmu=off"

vCPUs pinning settings

if enabled, each vCPU thread will be scheduled to a fixed CPU

qualified condition: num(vCPU threads) == num(CPUs in sandbox's CPUSet)

enable_vcpus_pinning = false

Default number of vCPUs per SB/VM:

unspecified or 0 --> will be set to 1

< 0 --> will be set to the actual number of physical cores

> 0 <= number of physical cores --> will be set to the specified number

> number of physical cores --> will be set to the actual number of physical cores

default_vcpus = 1

Default maximum number of vCPUs per SB/VM:

unspecified or == 0 --> will be set to the actual number of physical cores or to the maximum number

of vCPUs supported by KVM if that number is exceeded

> 0 <= number of physical cores --> will be set to the specified number

> number of physical cores --> will be set to the actual number of physical cores or to the maximum number

of vCPUs supported by KVM if that number is exceeded

WARNING: Depending of the architecture, the maximum number of vCPUs supported by KVM is used when

the actual number of physical cores is greater than it.

WARNING: Be aware that this value impacts the virtual machine's memory footprint and CPU

the hotplug functionality. For example, default_maxvcpus = 240 specifies that until 240 vCPUs

can be added to a SB/VM, but the memory footprint will be big. Another example, with

default_maxvcpus = 8 the memory footprint will be small, but 8 will be the maximum number of

vCPUs supported by the SB/VM. In general, we recommend that you do not edit this variable,

unless you know what are you doing.

NOTICE: on arm platform with gicv2 interrupt controller, set it to 8.

default_maxvcpus = 0

Bridges can be used to hot plug devices.

Limitations:

* Currently only pci bridges are supported

* Until 30 devices per bridge can be hot plugged.

* Until 5 PCI bridges can be cold plugged per VM.

This limitation could be a bug in qemu or in the kernel

Default number of bridges per SB/VM:

unspecified or 0 --> will be set to 1

> 1 <= 5 --> will be set to the specified number

> 5 --> will be set to 5

default_bridges = 1

Default memory size in MiB for SB/VM.

If unspecified then it will be set 2048 MiB.

default_memory = 2048

Default memory slots per SB/VM.

If unspecified then it will be set 10.

This is will determine the times that memory will be hotadded to sandbox/VM.

#memory_slots = 10

Default maximum memory in MiB per SB / VM

unspecified or == 0 --> will be set to the actual amount of physical RAM

> 0 <= amount of physical RAM --> will be set to the specified number

> amount of physical RAM --> will be set to the actual amount of physical RAM

default_maxmemory = 0

The size in MiB will be plused to max memory of hypervisor.

It is the memory address space for the NVDIMM devie.

If set block storage driver (block_device_driver) to "nvdimm",

should set memory_offset to the size of block device.

Default 0

#memory_offset = 0

Specifies virtio-mem will be enabled or not.

Please note that this option should be used with the command

"echo 1 > /proc/sys/vm/overcommit_memory".

Default false

#enable_virtio_mem = true

Disable block device from being used for a container's rootfs.

In case of a storage driver like devicemapper where a container's

root file system is backed by a block device, the block device is passed

directly to the hypervisor for performance reasons.

This flag prevents the block device from being passed to the hypervisor,

virtio-fs is used instead to pass the rootfs.

disable_block_device_use = false

Shared file system type:

- virtio-fs (default)

- virtio-9p

- virtio-fs-nydus

#shared_fs = "virtio-fs" shared_fs = "virtio-9p"

Path to vhost-user-fs daemon.

#virtio_fs_daemon = "/usr/libexec/virtiofsd"

List of valid annotations values for the virtiofs daemon

The default if not set is empty (all annotations rejected.)

Your distribution recommends: ["/usr/libexec/virtiofsd"]

valid_virtio_fs_daemon_paths = ["/usr/libexec/virtiofsd"]

Default size of DAX cache in MiB

virtio_fs_cache_size = 0

Default size of virtqueues

virtio_fs_queue_size = 1024

Extra args for virtiofsd daemon

Format example:

["-o", "arg1=xxx,arg2", "-o", "hello world", "--arg3=yyy"]

Examples:

Set virtiofsd log level to debug : ["-o", "log_level=debug"] or ["-d"]

see virtiofsd -h for possible options.

virtio_fs_extra_args = ["--thread-pool-size=1", "-o", "announce_submounts"]

Cache mode:

- none

Metadata, data, and pathname lookup are not cached in guest. They are

always fetched from host and any changes are immediately pushed to host.

- auto

Metadata and pathname lookup cache expires after a configured amount of

time (default is 1 second). Data is cached while the file is open (close

to open consistency).

- always

Metadata, data, and pathname lookup are cached in guest and never expire.

virtio_fs_cache = "auto"

Block storage driver to be used for the hypervisor in case the container

rootfs is backed by a block device. This is virtio-scsi, virtio-blk

or nvdimm.

block_device_driver = "virtio-scsi"

aio is the I/O mechanism used by qemu

Options:

- threads

Pthread based disk I/O.

- native

Native Linux I/O.

- io_uring

Linux io_uring API. This provides the fastest I/O operations on Linux, requires kernel>5.1 and

qemu >=5.0.

block_device_aio = "io_uring"

Specifies cache-related options will be set to block devices or not.

Default false

#block_device_cache_set = true

Specifies cache-related options for block devices.

Denotes whether use of O_DIRECT (bypass the host page cache) is enabled.

Default false

#block_device_cache_direct = true

Specifies cache-related options for block devices.

Denotes whether flush requests for the device are ignored.

Default false

#block_device_cache_noflush = true

Enable iothreads (data-plane) to be used. This causes IO to be

handled in a separate IO thread. This is currently only implemented

for SCSI.

enable_iothreads = false

Enable pre allocation of VM RAM, default false

Enabling this will result in lower container density

as all of the memory will be allocated and locked

This is useful when you want to reserve all the memory

upfront or in the cases where you want memory latencies

to be very predictable

Default false

#enable_mem_prealloc = true

Enable huge pages for VM RAM, default false

Enabling this will result in the VM memory

being allocated using huge pages.

This is useful when you want to use vhost-user network

stacks within the container. This will automatically

result in memory pre allocation

#enable_hugepages = true enable_hugepages = true

Enable vhost-user storage device, default false

Enabling this will result in some Linux reserved block type

major range 240-254 being chosen to represent vhost-user devices.

#enable_vhost_user_store = false enable_vhost_user_store = true

The base directory specifically used for vhost-user devices.

Its sub-path "block" is used for block devices; "block/sockets" is

where we expect vhost-user sockets to live; "block/devices" is where

simulated block device nodes for vhost-user devices to live.

vhost_user_store_path = "/var/run/kata-containers/vhost-user"

Enable vIOMMU, default false

Enabling this will result in the VM having a vIOMMU device

This will also add the following options to the kernel's

command line: intel_iommu=on,iommu=pt

#enable_iommu = true

Enable IOMMU_PLATFORM, default false

Enabling this will result in the VM device having iommu_platform=on set

#enable_iommu_platform = true

List of valid annotations values for the vhost user store path

The default if not set is empty (all annotations rejected.)

Your distribution recommends: ["/var/run/kata-containers/vhost-user"]

valid_vhost_user_store_paths = ["/var/run/kata-containers/vhost-user"]

Enable file based guest memory support. The default is an empty string which

will disable this feature. In the case of virtio-fs, this is enabled

automatically and '/dev/shm' is used as the backing folder.

This option will be ignored if VM templating is enabled.

#file_mem_backend = ""

List of valid annotations values for the file_mem_backend annotation

The default if not set is empty (all annotations rejected.)

Your distribution recommends: [""]

valid_file_mem_backends = [""]

-pflash can add image file to VM. The arguments of it should be in format

of ["/path/to/flash0.img", "/path/to/flash1.img"]

pflashes = []

This option changes the default hypervisor and kernel parameters

to enable debug output where available.

Default false

enable_debug = true

Disable the customizations done in the runtime when it detects

that it is running on top a VMM. This will result in the runtime

behaving as it would when running on bare metal.

#disable_nesting_checks = true

This is the msize used for 9p shares. It is the number of bytes

used for 9p packet payload.

#msize_9p = 8192

If false and nvdimm is supported, use nvdimm device to plug guest image.

Otherwise virtio-block device is used.

nvdimm is not supported when confidential_guest = true.

Default is false

#disable_image_nvdimm = true

VFIO devices are hotplugged on a bridge by default.

Enable hotplugging on root bus. This may be required for devices with

a large PCI bar, as this is a current limitation with hotplugging on

a bridge.

Default false

#hotplug_vfio_on_root_bus = true

Before hot plugging a PCIe device, you need to add a pcie_root_port device.

Use this parameter when using some large PCI bar devices, such as Nvidia GPU

The value means the number of pcie_root_port

This value is valid when hotplug_vfio_on_root_bus is true and machine_type is "q35"

Default 0

#pcie_root_port = 2

If vhost-net backend for virtio-net is not desired, set to true. Default is false, which trades off

security (vhost-net runs ring0) for network I/O performance.

#disable_vhost_net = true

Default entropy source.

The path to a host source of entropy (including a real hardware RNG)

/dev/urandom and /dev/random are two main options.

Be aware that /dev/random is a blocking source of entropy. If the host

runs out of entropy, the VMs boot time will increase leading to get startup

timeouts.

The source of entropy /dev/urandom is non-blocking and provides a

generally acceptable source of entropy. It should work well for pretty much

all practical purposes.

#entropy_source= "/dev/urandom"

List of valid annotations values for entropy_source

The default if not set is empty (all annotations rejected.)

Your distribution recommends: ["/dev/urandom","/dev/random",""]

valid_entropy_sources = ["/dev/urandom","/dev/random",""]

Path to OCI hook binaries in the guest rootfs.

This does not affect host-side hooks which must instead be added to

the OCI spec passed to the runtime.

You can create a rootfs with hooks by customizing the osbuilder scripts:

https://github.com/kata-containers/kata-containers/tree/main/tools/osbuilder

Hooks must be stored in a subdirectory of guest_hook_path according to their

hook type, i.e. "guest_hook_path/{prestart,poststart,poststop}".

The agent will scan these directories for executable files and add them, in

lexicographical order, to the lifecycle of the guest container.

Hooks are executed in the runtime namespace of the guest. See the official documentation:

https://github.com/opencontainers/runtime-spec/blob/v1.0.1/config.md#posix-platform-hooks

Warnings will be logged if any error is encountered while scanning for hooks,

but it will not abort container execution.

#guest_hook_path = "/usr/share/oci/hooks"

Use rx Rate Limiter to control network I/O inbound bandwidth(size in bits/sec for SB/VM).

In Qemu, we use classful qdiscs HTB(Hierarchy Token Bucket) to discipline traffic.

Default 0-sized value means unlimited rate.

#rx_rate_limiter_max_rate = 0

Use tx Rate Limiter to control network I/O outbound bandwidth(size in bits/sec for SB/VM).

In Qemu, we use classful qdiscs HTB(Hierarchy Token Bucket) and ifb(Intermediate Functional Block)

to discipline traffic.

Default 0-sized value means unlimited rate.

#tx_rate_limiter_max_rate = 0

Set where to save the guest memory dump file.

If set, when GUEST_PANICKED event occurred,

guest memeory will be dumped to host filesystem under guest_memory_dump_path,

This directory will be created automatically if it does not exist.

The dumped file(also called vmcore) can be processed with crash or gdb.

WARNING:

Dump guest’s memory can take very long depending on the amount of guest memory

and use much disk space.

#guest_memory_dump_path="/var/crash/kata"

If enable paging.

Basically, if you want to use "gdb" rather than "crash",

or need the guest-virtual addresses in the ELF vmcore,

then you should enable paging.

See: https://www.qemu.org/docs/master/qemu-qmp-ref.html#Dump-guest-memory for details

#guest_memory_dump_paging=false

Enable swap in the guest. Default false.

When enable_guest_swap is enabled, insert a raw file to the guest as the swap device

if the swappiness of a container (set by annotation "io.katacontainers.container.resource.swappiness")

is bigger than 0.

The size of the swap device should be

swap_in_bytes (set by annotation "io.katacontainers.container.resource.swap_in_bytes") - memory_limit_in_bytes.

If swap_in_bytes is not set, the size should be memory_limit_in_bytes.

If swap_in_bytes and memory_limit_in_bytes is not set, the size should

be default_memory.

#enable_guest_swap = true

use legacy serial for guest console if available and implemented for architecture. Default false

#use_legacy_serial = true

disable applying SELinux on the VMM process (default false)

disable_selinux=false

[factory]

VM templating support. Once enabled, new VMs are created from template

using vm cloning. They will share the same initial kernel, initramfs and

agent memory by mapping it readonly. It helps speeding up new container

creation and saves a lot of memory if there are many kata containers running

on the same host.

When disabled, new VMs are created from scratch.

Note: Requires "initrd=" to be set ("image=" is not supported).

Default false

#enable_template = true

Specifies the path of template.

Default "/run/vc/vm/template"

#template_path = "/run/vc/vm/template"

The number of caches of VMCache:

unspecified or == 0 --> VMCache is disabled

> 0 --> will be set to the specified number

VMCache is a function that creates VMs as caches before using it.

It helps speed up new container creation.

The function consists of a server and some clients communicating

through Unix socket. The protocol is gRPC in protocols/cache/cache.proto.

The VMCache server will create some VMs and cache them by factory cache.

It will convert the VM to gRPC format and transport it when gets

requestion from clients.

Factory grpccache is the VMCache client. It will request gRPC format

VM and convert it back to a VM. If VMCache function is enabled,

kata-runtime will request VM from factory grpccache when it creates

a new sandbox.

Default 0

#vm_cache_number = 0

Specify the address of the Unix socket that is used by VMCache.

Default /var/run/kata-containers/cache.sock

#vm_cache_endpoint = "/var/run/kata-containers/cache.sock"

[agent.kata]

If enabled, make the agent display debug-level messages.

(default: disabled)

enable_debug = true

Enable agent tracing.

If enabled, the agent will generate OpenTelemetry trace spans.

Notes:

- If the runtime also has tracing enabled, the agent spans will be

associated with the appropriate runtime parent span.

- If enabled, the runtime will wait for the container to shutdown,

increasing the container shutdown time slightly.

(default: disabled)

#enable_tracing = true

Comma separated list of kernel modules and their parameters.

These modules will be loaded in the guest kernel using modprobe(8).

The following example can be used to load two kernel modules with parameters

- kernel_modules=["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915 enable_ppgtt=0"]

The first word is considered as the module name and the rest as its parameters.

Container will not be started when:

* A kernel module is specified and the modprobe command is not installed in the guest

or it fails loading the module.

* The module is not available in the guest or it doesn't met the guest kernel

requirements, like architecture and version.

kernel_modules=[]

Enable debug console.

If enabled, user can connect guest OS running inside hypervisor

through "kata-runtime exec " command

#debug_console_enabled = true

Agent connection dialing timeout value in seconds

(default: 30)

#dial_timeout = 30

[runtime]

If enabled, the runtime will log additional debug messages to the

system log

(default: disabled)

enable_debug = true

Internetworking model

Determines how the VM should be connected to the

the container network interface

Options:

- macvtap

Used when the Container network interface can be bridged using

macvtap.

- none

Used when customize network. Only creates a tap device. No veth pair.

- tcfilter

Uses tc filter rules to redirect traffic from the network interface

provided by plugin to a tap interface connected to the VM.

internetworking_model="tcfilter"

disable guest seccomp

Determines whether container seccomp profiles are passed to the virtual

machine and applied by the kata agent. If set to true, seccomp is not applied

within the guest

(default: true)

disable_guest_seccomp=true

If enabled, the runtime will create opentracing.io traces and spans.

(See https://www.jaegertracing.io/docs/getting-started).

(default: disabled)

#enable_tracing = true

Set the full url to the Jaeger HTTP Thrift collector.

The default if not set will be "http://localhost:14268/api/traces"

#jaeger_endpoint = ""

Sets the username to be used if basic auth is required for Jaeger.

#jaeger_user = ""

Sets the password to be used if basic auth is required for Jaeger.

#jaeger_password = ""

If enabled, the runtime will not create a network namespace for shim and hypervisor processes.

This option may have some potential impacts to your host. It should only be used when you know what you're doing.

disable_new_netns conflicts with internetworking_model=tcfilter and internetworking_model=macvtap. It works only

with internetworking_model=none. The tap device will be in the host network namespace and can connect to a bridge

(like OVS) directly.

(default: false)

#disable_new_netns = true

if enabled, the runtime will add all the kata processes inside one dedicated cgroup.

The container cgroups in the host are not created, just one single cgroup per sandbox.

The runtime caller is free to restrict or collect cgroup stats of the overall Kata sandbox.

The sandbox cgroup path is the parent cgroup of a container with the PodSandbox annotation.

The sandbox cgroup is constrained if there is no container type annotation.

See: https://pkg.go.dev/github.com/kata-containers/kata-containers/src/runtime/virtcontainers#ContainerType

sandbox_cgroup_only=false

If enabled, the runtime will attempt to determine appropriate sandbox size (memory, CPU) before booting the virtual machine. In

this case, the runtime will not dynamically update the amount of memory and CPU in the virtual machine. This is generally helpful

when a hardware architecture or hypervisor solutions is utilized which does not support CPU and/or memory hotplug.

Compatibility for determining appropriate sandbox (VM) size:

- When running with pods, sandbox sizing information will only be available if using Kubernetes >= 1.23 and containerd >= 1.6. CRI-O

does not yet support sandbox sizing annotations.

- When running single containers using a tool like ctr, container sizing information will be available.

static_sandbox_resource_mgmt=false

If specified, sandbox_bind_mounts identifieds host paths to be mounted (ro) into the sandboxes shared path.

This is only valid if filesystem sharing is utilized. The provided path(s) will be bindmounted into the shared fs directory.

If defaults are utilized, these mounts should be available in the guest at /run/kata-containers/shared/containers/sandbox-mounts

These will not be exposed to the container workloads, and are only provided for potential guest services.

sandbox_bind_mounts=[]

VFIO Mode

Determines how VFIO devices should be be presented to the container.

Options:

- vfio

Matches behaviour of OCI runtimes (e.g. runc) as much as

possible. VFIO devices will appear in the container as VFIO

character devices under /dev/vfio. The exact names may differ

from the host (they need to match the VM's IOMMU group numbers

rather than the host's)

- guest-kernel

This is a Kata-specific behaviour that's useful in certain cases.

The VFIO device is managed by whatever driver in the VM kernel

claims it. This means it will appear as one or more device nodes

or network interfaces depending on the nature of the device.

Using this mode requires specially built workloads that know how

to locate the relevant device interfaces within the VM.

vfio_mode="guest-kernel"

If enabled, the runtime will not create Kubernetes emptyDir mounts on the guest filesystem. Instead, emptyDir mounts will

be created on the host and shared via virtio-fs. This is potentially slower, but allows sharing of files from host to guest.

disable_guest_empty_dir=false

Enabled experimental feature list, format: ["a", "b"].

Experimental features are features not stable enough for production,

they may break compatibility, and are prepared for a big version bump.

Supported experimental features:

(default: [])

experimental=[]

If enabled, user can run pprof tools with shim v2 process through kata-monitor.

(default: false)

enable_pprof = true

WARNING: All the options in the following section have not been implemented yet.

This section was added as a placeholder. DO NOT USE IT!

[image]

Container image service.

Offload the CRI image management service to the Kata agent.

(default: false)

#service_offload = true

Container image decryption keys provisioning.

Applies only if service_offload is true.

Keys can be provisioned locally (e.g. through a special command or

a local file) or remotely (usually after the guest is remotely attested).

The provision setting is a complete URL that lets the Kata agent decide

which method to use in order to fetch the keys.

Keys can be stored in a local file, in a measured and attested initrd:

#provision=data:///local/key/file

Keys could be fetched through a special command or binary from the

initrd (guest) image, e.g. a firmware call:

#provision=file:///path/to/bin/fetcher/in/guest

Keys can be remotely provisioned. The Kata agent fetches them from e.g.

a HTTPS URL:

#provision=https://my-key-broker.foo/tenant/

/etc/containerd/config.toml

disabled_plugins = [] imports = [] oom_score = 0 plugin_dir = "" required_plugins = [] root = "/var/lib/containerd" state = "/run/containerd" temp = "" version = 2

[cgroup] path = ""

[debug] address = "" format = "" gid = 0 level = "" uid = 0

[grpc] address = "/run/containerd/containerd.sock" gid = 0 max_recv_message_size = 16777216 max_send_message_size = 16777216 tcp_address = "" tcp_tls_ca = "" tcp_tls_cert = "" tcp_tls_key = "" uid = 0

[metrics] address = "" grpc_histogram = false

[plugins]

[plugins."io.containerd.gc.v1.scheduler"] deletion_threshold = 0 mutation_threshold = 100 pause_threshold = 0.02 schedule_delay = "0s" startup_delay = "100ms"

[plugins."io.containerd.grpc.v1.cri"] device_ownership_from_security_context = false disable_apparmor = false disable_cgroup = false disable_hugetlb_controller = true disable_proc_mount = false disable_tcp_service = true enable_selinux = false enable_tls_streaming = false enable_unprivileged_icmp = false enable_unprivileged_ports = false ignore_image_defined_volumes = false max_concurrent_downloads = 3 max_container_log_line_size = 16384 netns_mounts_under_state_dir = false restrict_oom_score_adj = false sandbox_image = "registry.k8s.io/pause:3.6" selinux_category_range = 1024 stats_collect_period = 10 stream_idle_timeout = "4h0m0s" stream_server_address = "127.0.0.1" stream_server_port = "0" systemd_cgroup = false tolerate_missing_hugetlb_controller = true unset_seccomp_profile = ""

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"
  conf_template = ""
  ip_pref = ""
  max_conf_num = 1

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "runc"
  disable_snapshot_annotations = true
  discard_unpacked_layers = false
  ignore_rdt_not_enabled_errors = false
  no_pivot = false
  snapshotter = "overlayfs"

  [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
    base_runtime_spec = ""
    cni_conf_dir = ""
    cni_max_conf_num = 0
    container_annotations = []
    pod_annotations = []
    privileged_without_host_devices = false
    runtime_engine = ""
    runtime_path = ""
    runtime_root = ""
    runtime_type = ""

    [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
      runtime_type = "io.containerd.kata.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata.options]
      ConfigPath = "/etc/kata-containers/configuration.toml"

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      base_runtime_spec = ""
      cni_conf_dir = ""
      cni_max_conf_num = 0
      container_annotations = []
      pod_annotations = []
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_path = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v1"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        BinaryName = "runc"
        CriuImagePath = ""
        CriuPath = ""
        CriuWorkPath = ""
        IoGid = 0
        IoUid = 0
        NoNewKeyring = false
        NoPivotRoot = false
        Root = ""
        ShimCgroup = ""
        SystemdCgroup = false

  [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
    base_runtime_spec = ""
    cni_conf_dir = ""
    cni_max_conf_num = 0
    container_annotations = []
    pod_annotations = []
    privileged_without_host_devices = false
    runtime_engine = ""
    runtime_path = ""
    runtime_root = ""
    runtime_type = ""

    [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

[plugins."io.containerd.grpc.v1.cri".image_decryption]
  key_model = "node"

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = ""

  [plugins."io.containerd.grpc.v1.cri".registry.auths]

  [plugins."io.containerd.grpc.v1.cri".registry.configs]

  [plugins."io.containerd.grpc.v1.cri".registry.headers]

  [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
  tls_cert_file = ""
  tls_key_file = ""

[plugins."io.containerd.internal.v1.opt"] path = "/opt/containerd"

[plugins."io.containerd.internal.v1.restart"] interval = "10s"

[plugins."io.containerd.internal.v1.tracing"] sampling_ratio = 1.0 service_name = "containerd"

[plugins."io.containerd.metadata.v1.bolt"] content_sharing_policy = "shared"

[plugins."io.containerd.monitor.v1.cgroups"] no_prometheus = false

[plugins."io.containerd.runtime.v1.linux"] no_shim = false runtime = "runc" runtime_root = "" shim = "containerd-shim" shim_debug = false

[plugins."io.containerd.runtime.v2.task"] platforms = ["linux/amd64"] sched_core = false

[plugins."io.containerd.service.v1.diff-service"] default = ["walking"]

[plugins."io.containerd.service.v1.tasks-service"] rdt_config_file = ""

[plugins."io.containerd.snapshotter.v1.aufs"] root_path = ""

[plugins."io.containerd.snapshotter.v1.btrfs"] root_path = ""

[plugins."io.containerd.snapshotter.v1.devmapper"] async_remove = false base_image_size = "" discard_blocks = false fs_options = "" fs_type = "" pool_name = "" root_path = ""

[plugins."io.containerd.snapshotter.v1.native"] root_path = ""

[plugins."io.containerd.snapshotter.v1.overlayfs"] root_path = "" upperdir_label = false

[plugins."io.containerd.snapshotter.v1.zfs"] root_path = ""

[plugins."io.containerd.tracing.processor.v1.otlp"] endpoint = "" insecure = false protocol = ""

[proxy_plugins]

[stream_processors]

[stream_processors."io.containerd.ocicrypt.decoder.v1.tar"] accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"] args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"] env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"] path = "ctd-decoder" returns = "application/vnd.oci.image.layer.v1.tar"

[stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"] accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"] args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"] env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"] path = "ctd-decoder" returns = "application/vnd.oci.image.layer.v1.tar+gzip"

[timeouts] "io.containerd.timeout.bolt.open" = "0s" "io.containerd.timeout.shim.cleanup" = "5s" "io.containerd.timeout.shim.load" = "5s" "io.containerd.timeout.shim.shutdown" = "3s" "io.containerd.timeout.task.state" = "2s"

[ttrpc] address = "" gid = 0 uid = 0

请问怎么解决这个问题呢?

原提问者GitHub用户immersommer 如对项目有进一步反馈,请在 GitHub 提交 issue https://github.com/kata-containers/kata-containers/issues

展开
收起
码字王 2023-05-17 16:09:05 520 0
2 条回答
写回答
取消 提交回答
  • 值得去的地方都没有捷径

    根据错误信息,可以看出是kubelet在创建pod sandbox时出现了问题,具体原因是无法打开/dev/vhost-net设备。这个设备是vhost-net驱动程序所需的设备,用于支持vhost网络模式。

    解决这个问题的方法是在kubelet的启动参数中添加--feature-gates=DevicePlugins=true,以启用设备插件功能。设备插件是一种Kubernetes机制,用于管理节点上的设备资源,包括GPU、FPGA、RDMA等。启用设备插件后,可以通过容器资源配置来请求节点上的设备资源。

    具体步骤如下:

    编辑kubelet配置文件/etc/default/kubelet,添加以下参数: KUBELET_EXTRA_ARGS="--feature-gates=DevicePlugins=true" 重启kubelet服务: systemctl restart kubelet 在Pod的spec中添加以下配置,以请求vhost-net设备: apiVersion: v1 kind: Pod metadata: name: nginx-kata-containers spec: runtimeClassName: kata containers: - name: nginx image: nginx nodeSelector: kubernetes.io/arch: amd64 tolerations: - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 300 volumes: - name: vhost-net hostPath: path: /dev/vhost-net securityContext: seLinuxOptions: type: spc_t # 请求vhost-net设备 devices: - name: vhost-net devicePath: /dev/vhost-net 这样就可以在Pod中请求vhost-net设备了。注意,这种方法只适用于单节点集群,如果是多节点集群,需要使用更复杂的网络配置。

    2023-05-23 15:32:07
    赞同 展开评论 打赏
  • 加载如下模块试下:

    modprobe kvm modprobe vhost modprobe vhost-vsock modprobe vhost-net

    原回答者GitHub用户 如对项目有进一步反馈,请在 GitHub 提交 issue https://github.com/kata-containers/kata-containers/issues

    2023-05-17 16:23:39
    赞同 展开评论 打赏

国内唯一 Forrester 公共云容器平台领导者象限。

相关产品

  • 容器服务Kubernetes版
  • 相关电子书

    更多
    使用CNFS搭建弹性Web服务 立即下载
    阿里云文件存储 NAS 在容器场景的最佳实践 立即下载
    何种数据存储才能助力容器计算 立即下载

    相关镜像