Using sysctls in a Kubernetes Cluster
Kubernetes v1.21 [stable]
This document describes how to configure and use kernel parameters within a Kubernetes cluster using the sysctl interface.
/
or .
as separators for sysctl names.
For example, you can represent the same sysctl name as kernel.shm_rmid_forced
using a
period as the separator, or as kernel/shm_rmid_forced
using a slash as a separator.
For more sysctl parameter conversion method details, please refer to
the page sysctl.d(5) from
the Linux man-pages project.
Setting Sysctls for a Pod and PodSecurityPolicy features do not yet support
setting sysctls with slashes.
Before you begin
You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. It is recommended to run this tutorial on a cluster with at least two nodes that are not acting as control plane hosts. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:
For some steps, you also need to be able to reconfigure the command line options for the kubelets running on your cluster.
Listing all Sysctl Parameters
In Linux, the sysctl interface allows an administrator to modify kernel
parameters at runtime. Parameters are available via the /proc/sys/
virtual
process file system. The parameters cover various subsystems such as:
- kernel (common prefix:
kernel.
) - networking (common prefix:
net.
) - virtual memory (common prefix:
vm.
) - MDADM (common prefix:
dev.
) - More subsystems are described in Kernel docs.
To get a list of all parameters, you can run
sudo sysctl -a
Enabling Unsafe Sysctls
Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing, a safe sysctl must be properly isolated between pods on the same node. This means that setting a safe sysctl for one pod
- must not have any influence on any other pod on the node
- must not allow to harm the node's health
- must not allow to gain CPU or memory resources outside of the resource limits of a pod.
By far, most of the namespaced sysctls are not necessarily considered safe. The following sysctls are supported in the safe set:
kernel.shm_rmid_forced
,net.ipv4.ip_local_port_range
,net.ipv4.tcp_syncookies
,net.ipv4.ping_group_range
(since Kubernetes 1.18),net.ipv4.ip_unprivileged_port_start
(since Kubernetes 1.22).
net.ipv4.tcp_syncookies
is not namespaced on Linux kernel version 4.4 or lower.
This list will be extended in future Kubernetes versions when the kubelet supports better isolation mechanisms.
All safe sysctls are enabled by default.
All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis. Pods with disabled unsafe sysctls will be scheduled, but will fail to launch.
With the warning above in mind, the cluster admin can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning. Unsafe sysctls are enabled on a node-by-node basis with a flag of the kubelet; for example:
kubelet --allowed-unsafe-sysctls \
'kernel.msg*,net.core.somaxconn' ...
For Minikube, this can be done via the extra-config
flag:
minikube start --extra-config="kubelet.allowed-unsafe-sysctls=kernel.msg*,net.core.somaxconn"...
Only namespaced sysctls can be enabled this way.
Setting Sysctls for a Pod
A number of sysctls are namespaced in today's Linux kernels. This means that they can be set independently for each pod on a node. Only namespaced sysctls are configurable via the pod securityContext within Kubernetes.
The following sysctls are known to be namespaced. This list could change in future versions of the Linux kernel.
kernel.shm*
,kernel.msg*
,kernel.sem
,fs.mqueue.*
,- The parameters under
net.*
that can be set in container networking namespace. However, there are exceptions (e.g.,net.netfilter.nf_conntrack_max
andnet.netfilter.nf_conntrack_expect_max
can be set in container networking namespace but they are unnamespaced).
Sysctls with no namespace are called node-level sysctls. If you need to set them, you must manually configure them on each node's operating system, or by using a DaemonSet with privileged containers.
Use the pod securityContext to configure namespaced sysctls. The securityContext applies to all containers in the same pod.
This example uses the pod securityContext to set a safe sysctl
kernel.shm_rmid_forced
and two unsafe sysctls net.core.somaxconn
and
kernel.msgmax
. There is no distinction between safe and unsafe sysctls in
the specification.
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.core.somaxconn
value: "1024"
- name: kernel.msgmax
value: "65536"
...
It is good practice to consider nodes with special sysctl settings as tainted within a cluster, and only schedule pods onto them which need those sysctl settings. It is suggested to use the Kubernetes taints and toleration feature to implement this.
A pod with the unsafe sysctls will fail to launch on any node which has not enabled those two unsafe sysctls explicitly. As with node-level sysctls it is recommended to use taints and toleration feature or taints on nodes to schedule those pods onto the right nodes.
PodSecurityPolicy
Kubernetes v1.21 [deprecated]
You can further control which sysctls can be set in pods by specifying lists of
sysctls or sysctl patterns in the forbiddenSysctls
and/or
allowedUnsafeSysctls
fields of the PodSecurityPolicy. A sysctl pattern ends
with a *
character, such as kernel.*
. A *
character on its own matches
all sysctls.
By default, all safe sysctls are allowed.
Both forbiddenSysctls
and allowedUnsafeSysctls
are lists of plain sysctl names
or sysctl patterns (which end with *
). The string *
matches all sysctls.
The forbiddenSysctls
field excludes specific sysctls. You can forbid a
combination of safe and unsafe sysctls in the list. To forbid setting any
sysctls, use *
on its own.
If you specify any unsafe sysctl in the allowedUnsafeSysctls
field and it is
not present in the forbiddenSysctls
field, that sysctl can be used in Pods
using this PodSecurityPolicy. To allow all unsafe sysctls in the
PodSecurityPolicy to be set, use *
on its own.
Do not configure these two fields such that there is overlap, meaning that a given sysctl is both allowed and forbidden.
allowedUnsafeSysctls
field
in a PodSecurityPolicy, any pod using such a sysctl will fail to start
if the sysctl is not allowed via the --allowed-unsafe-sysctls
kubelet
flag as well on that node.
This example allows unsafe sysctls prefixed with kernel.msg
to be set and
disallows setting of the kernel.shm_rmid_forced
sysctl.
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: sysctl-psp
spec:
allowedUnsafeSysctls:
- kernel.msg*
forbiddenSysctls:
- kernel.shm_rmid_forced
...