Non-root Containers And Devices
Author: Mikko Ylinen (Intel)
The user/group ID related security settings in Pod's securityContext
trigger a problem when users want to
deploy containers that use accelerator devices (via Kubernetes Device Plugins) on Linux. In this blog
post I talk about the problem and describe the work done so far to address it. It's not meant to be a long story about getting the k/k issue fixed.
Instead, this post aims to raise awareness of the issue and to highlight important device use-cases too. This is needed as Kubernetes works on new related features such as support for user namespaces.
Why non-root containers can't use devices and why it matters
One of the key security principles for running containers in Kubernetes is the
principle of least privilege. The Pod/container securityContext
specifies the config
options to set, e.g., Linux capabilities, MAC policies, and user/group ID values to achieve this.
Furthermore, the cluster admins are supported with tools like PodSecurityPolicy (deprecated) or
Pod Security Admission (alpha) to enforce the desired security settings for pods that are being deployed in
the cluster. These settings could, for instance, require that containers must be runAsNonRoot
or
that they are forbidden from running with root's group ID in runAsGroup
or supplementalGroups
.
In Kubernetes, the kubelet builds the list of Device
resources to be made available to a container
(based on inputs from the Device Plugins) and the list is included in the CreateContainer CRI message
sent to the CRI container runtime. Each Device
contains little information: host/container device
paths and the desired devices cgroups permissions.
The OCI Runtime Spec for Linux Container Configuration expects that in addition to the devices cgroup fields, more detailed information about the devices must be provided:
{
"type": "<string>",
"path": "<string>",
"major": <int64>,
"minor": <int64>,
"fileMode": <uint32>,
"uid": <uint32>,
"gid": <uint32>
},
The CRI container runtimes (containerd, CRI-O) are responsible for obtaining this information
from the host for each Device
. By default, the runtimes copy the host device's user and group IDs:
uid
(uint32, OPTIONAL) - id of device owner in the container namespace.gid
(uint32, OPTIONAL) - id of device group in the container namespace.
Similarly, the runtimes prepare other mandatory config.json
sections based on the CRI fields,
including the ones defined in securityContext
: runAsUser
/runAsGroup
, which become part of the POSIX
platforms user structure via:
uid
(int, REQUIRED) specifies the user ID in the container namespace.gid
(int, REQUIRED) specifies the group ID in the container namespace.additionalGids
(array of ints, OPTIONAL) specifies additional group IDs in the container namespace to be added to the process.
However, the resulting config.json
triggers a problem when trying to run containers with
both devices added and with non-root uid/gid set via runAsUser
/runAsGroup
: the container user process
has no permission to use the device even when its group id (gid, copied from host) was permissive to
non-root groups. This is because the container user does not belong to that host group (e.g., via additionalGids
).
Being able to run applications that use devices as non-root user is normal and expected to work so that the security principles can be met. Therefore, several alternatives were considered to get the gap filled with what the PodSec/CRI/OCI supports today.
What was done to solve the issue?
You might have noticed from the problem definition that it would at least be possible to workaround
the problem by manually adding the device gid(s) to supplementalGroups
, or in
the case of just one device, set runAsGroup
to the device's group id. However, this is problematic because the device gid(s) may have
different values depending on the nodes' distro/version in the cluster. For example, with GPUs the following commands for different distros and versions return different gids:
Fedora 33:
$ ls -l /dev/dri/
total 0
drwxr-xr-x. 2 root root 80 19.10. 10:21 by-path
crw-rw----+ 1 root video 226, 0 19.10. 10:42 card0
crw-rw-rw-. 1 root render 226, 128 19.10. 10:21 renderD128
$ grep -e video -e render /etc/group
video:x:39:
render:x:997:
Ubuntu 20.04:
$ ls -l /dev/dri/
total 0
drwxr-xr-x 2 root root 80 19.10. 17:36 by-path
crw-rw---- 1 root video 226, 0 19.10. 17:36 card0
crw-rw---- 1 root render 226, 128 19.10. 17:36 renderD128
$ grep -e video -e render /etc/group
video:x:44:
render:x:133:
Which number to choose in your securityContext
? Also, what if the runAsGroup
/runAsUser
values cannot be hard-coded because
they are automatically assigned during pod admission time via external security policies?
Unlike volumes with fsGroup
, the devices have no official notion of deviceGroup
/deviceUser
that the CRI runtimes (or kubelet)
would be able to use. We considered using container annotations set by the device plugins (e.g., io.kubernetes.cri.hostDeviceSupplementalGroup/
) to get custom OCI config.json
uid/gid values.
This would have required changes to all existing device plugins which was not ideal.
Instead, a solution that is seamless to end-users without getting the device plugin vendors involved was preferred. The selected approach was
to re-use runAsUser
and runAsGroup
values in config.json
for devices:
{
"type": "c",
"path": "/dev/foo",
"major": 123,
"minor": 4,
"fileMode": 438,
"uid": <runAsUser>,
"gid": <runAsGroup>
},
With runc
OCI runtime (in non-rootless mode), the device is created (mknod(2)
) in
the container namespace and the ownership is changed to runAsUser
/runAsGroup
using chmod(2)
.
runAsUser
/runAsGroup
are taken into account, and, e.g., the USER
setting in the container is currently ignored.
While it is likely that the "faulty" deployments (i.e., non-root securityContext
+ devices) do not exist, to be absolutely sure no
deployments break, an opt-in config entry in both containerd and CRI-O to enable the new behavior was added. The following:
device_ownership_from_security_context (bool)
defaults to false
and must be enabled to use the feature.
See non-root containers using devices after the fix
To demonstrate the new behavior, let's use a Data Plane Development Kit (DPDK) application using hardware accelerators, Kubernetes CPU manager, and HugePages as an example. The cluster runs containerd with:
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
device_ownership_from_security_context = true
or CRI-O with:
[crio.runtime]
device_ownership_from_security_context = true
and the Guaranteed
QoS Class Pod that runs DPDK's crypto-perf test utility with this YAML:
...
metadata:
name: qat-dpdk
spec:
securityContext:
runAsUser: 1000
runAsGroup: 2000
fsGroup: 3000
containers:
- name: crypto-perf
image: intel/crypto-perf:devel
...
resources:
requests:
cpu: "3"
memory: "128Mi"
qat.intel.com/generic: '4'
hugepages-2Mi: "128Mi"
limits:
cpu: "3"
memory: "128Mi"
qat.intel.com/generic: '4'
hugepages-2Mi: "128Mi"
...
To verify the results, check the user and group ID that the container runs as:
$ kubectl exec -it qat-dpdk -c crypto-perf -- id
They are set to non-zero values as expected:
uid=1000 gid=2000 groups=2000,3000
Next, check the device node permissions (qat.intel.com/generic
exposes /dev/vfio/
devices) are accessible to runAsUser
/runAsGroup
:
$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/vfio
total 0
drwxr-xr-x 2 root root 140 Sep 7 10:55 .
drwxr-xr-x 7 root root 380 Sep 7 10:55 ..
crw------- 1 1000 2000 241, 0 Sep 7 10:55 58
crw------- 1 1000 2000 241, 2 Sep 7 10:55 60
crw------- 1 1000 2000 241, 10 Sep 7 10:55 68
crw------- 1 1000 2000 241, 11 Sep 7 10:55 69
crw-rw-rw- 1 1000 2000 10, 196 Sep 7 10:55 vfio
Finally, check the non-root container is also allowed to create HugePages:
$ kubectl exec -it qat-dpdk -c crypto-perf -- ls -la /dev/hugepages/
fsGroup
gives a runAsUser
writable HugePages emptyDir mountpoint:
total 0
drwxrwsr-x 2 root 3000 0 Sep 7 10:55 .
drwxr-xr-x 7 root root 380 Sep 7 10:55 ..
Help us test it and provide feedback!
The functionality described here is expected to help with cluster security and the configurability of device permissions. To allow
non-root containers to use devices requires cluster admins to opt-in to the functionality by setting
device_ownership_from_security_context = true
. To make it a default setting, please test it and provide your feedback (via SIG-Node meetings or issues)!
The flag is available in CRI-O v1.22 release and queued for containerd v1.6.
More work is needed to get it properly supported. It is known to work with runc
but it also needs to be made to function
with other OCI runtimes too, where applicable. For instance, Kata Containers supports device passthrough and allows it to make devices
available to containers in VM sandboxes too.
Moreover, the additional challenge comes with support of user names and devices. This problem is still open and requires more brainstorming.
Finally, it needs to be understood whether runAsUser
/runAsGroup
are enough or if device specific settings similar to fsGroups
are needed in PodSpec/CRI v2.
Thanks
My thanks goes to Mike Brown (IBM, containerd), Peter Hunt (Redhat, CRI-O), and Alexander Kanevskiy (Intel) for providing all the feedback and good conversations.